[jira] [Commented] (MESOS-3352) Problem Statement Summary for Systemd Cgroup Launcher
[ https://issues.apache.org/jira/browse/MESOS-3352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14743885#comment-14743885 ] James DeFelice commented on MESOS-3352: --- For the proposed independent slice that will house tasks, is the name/path of that slice predictable/discoverable from within a custom executor? On Mon, Sep 14, 2015 at 11:01 AM, Joris Van Remoortere (JIRA) < -- James DeFelice 585.241.9488 (voice) 650.649.6071 (fax) > Problem Statement Summary for Systemd Cgroup Launcher > - > > Key: MESOS-3352 > URL: https://issues.apache.org/jira/browse/MESOS-3352 > Project: Mesos > Issue Type: Task >Reporter: Joris Van Remoortere >Assignee: Joris Van Remoortere > Labels: design, mesosphere, systemd > > There have been many reports of cgroups related issues when running Mesos on > Systemd. > Many of these issues are rooted in the manual manipulation of the cgroups > filesystem by Mesos. > This task is to describe the problem in a 1-page summary, and elaborate on > the suggested 2 part solution: > 1. Using the {{delegate=true}} flag for the slave > 2. Implementing a Systemd launcher to run executors with tighter Systemd > integration. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-3408) Labels field of FrameworkInfo should be added into v1 mesos.proto
[ https://issues.apache.org/jira/browse/MESOS-3408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James DeFelice updated MESOS-3408: -- Labels: mesosphere (was: ) > Labels field of FrameworkInfo should be added into v1 mesos.proto > - > > Key: MESOS-3408 > URL: https://issues.apache.org/jira/browse/MESOS-3408 > Project: Mesos > Issue Type: Bug >Reporter: Qian Zhang >Assignee: Qian Zhang > Labels: mesosphere > Fix For: 0.25.0 > > > In [MESOS-2841|https://issues.apache.org/jira/browse/MESOS-2841], a new field > "Labels" has been added into FrameworkInfo in mesos.proto, but is missed in > v1 mesos.proto. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3507) As an operator, I want a way to inspect queued tasks in running schedulers
[ https://issues.apache.org/jira/browse/MESOS-3507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14909004#comment-14909004 ] James DeFelice commented on MESOS-3507: --- I like the idea of frameworks publishing multiple urls. Perhaps labeled. Maybe an endpoint message that consists of a name and a url. Possibly add acl or visibility later. Frameworks could publish multiple endpoints. This would be great for kubernetes. > As an operator, I want a way to inspect queued tasks in running schedulers > -- > > Key: MESOS-3507 > URL: https://issues.apache.org/jira/browse/MESOS-3507 > Project: Mesos > Issue Type: Story >Reporter: Niklas Quarfot Nielsen > > Currently, there is no uniform way of getting a notion of 'awaiting' tasks > i.e. expressing that a framework has more work to do. This information is > useful for auto-scaling and anomaly detection systems. Schedulers tend to > expose this over their own http endpoints, but the format across schedulers > are most likely not compatible. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3352) Problem Statement Summary for Systemd Cgroup Launcher
[ https://issues.apache.org/jira/browse/MESOS-3352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14939795#comment-14939795 ] James DeFelice commented on MESOS-3352: --- What release is this targetting? > Problem Statement Summary for Systemd Cgroup Launcher > - > > Key: MESOS-3352 > URL: https://issues.apache.org/jira/browse/MESOS-3352 > Project: Mesos > Issue Type: Task >Reporter: Joris Van Remoortere >Assignee: Joris Van Remoortere > Labels: design, mesosphere, systemd > > There have been many reports of cgroups related issues when running Mesos on > Systemd. > Many of these issues are rooted in the manual manipulation of the cgroups > filesystem by Mesos. > This task is to describe the problem in a 1-page summary, and elaborate on > the suggested 2 part solution: > 1. Using the {{delegate=true}} flag for the slave > 2. Implementing a Systemd launcher to run executors with tighter Systemd > integration. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3352) Problem Statement Summary for Systemd Cgroup Launcher
[ https://issues.apache.org/jira/browse/MESOS-3352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14942537#comment-14942537 ] James DeFelice commented on MESOS-3352: --- Awesome. Thanks for the heads up On Oct 3, 2015 6:48 PM, "Joris Van Remoortere (JIRA)" > Problem Statement Summary for Systemd Cgroup Launcher > - > > Key: MESOS-3352 > URL: https://issues.apache.org/jira/browse/MESOS-3352 > Project: Mesos > Issue Type: Task >Reporter: Joris Van Remoortere >Assignee: Joris Van Remoortere > Labels: design, mesosphere, systemd > > There have been many reports of cgroups related issues when running Mesos on > Systemd. > Many of these issues are rooted in the manual manipulation of the cgroups > filesystem by Mesos. > This task is to describe the problem in a 1-page summary, and elaborate on > the suggested 2 part solution: > 1. Using the {{delegate=true}} flag for the slave > 2. Implementing a Systemd launcher to run executors with tighter Systemd > integration. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-4065) slave FD for ZK tcp connection leaked to executor process
James DeFelice created MESOS-4065: - Summary: slave FD for ZK tcp connection leaked to executor process Key: MESOS-4065 URL: https://issues.apache.org/jira/browse/MESOS-4065 Project: Mesos Issue Type: Bug Affects Versions: 0.25.0, 0.24.1 Reporter: James DeFelice {code} core@ip-10-0-0-45 ~ $ ps auxwww|grep -e etcd root 1432 99.3 0.0 202420 12928 ?Rsl 21:32 13:51 ./etcd-mesos-executor -log_dir=./ root 1450 0.4 0.1 38332 28752 ?Sl 21:32 0:03 ./etcd --data-dir=etcd_data --name=etcd-1449178273 --listen-peer-urls=http://10.0.0.45:1025 --initial-advertise-peer-urls=http://10.0.0.45:1025 --listen-client-urls=http://10.0.0.45:1026 --advertise-client-urls=http://10.0.0.45:1026 --initial-cluster=etcd-1449178273=http://10.0.0.45:1025,etcd-1449178271=http://10.0.2.95:1025,etcd-1449178272=http://10.0.2.216:1025 --initial-cluster-state=existing core 1651 0.0 0.0 6740 928 pts/0S+ 21:46 0:00 grep --colour=auto -e etcd core@ip-10-0-0-45 ~ $ sudo lsof -p 1432|grep -e 2181 etcd-meso 1432 root 10u IPv4 21973 0t0TCP ip-10-0-0-45.us-west-2.compute.internal:54016->ip-10-0-5-206.us-west-2.compute.internal:2181 (ESTABLISHED) core@ip-10-0-0-45 ~ $ ps auxwww|grep -e slave root 1124 0.2 0.1 900496 25736 ?Ssl 21:11 0:04 /opt/mesosphere/packages/mesos--52cbecde74638029c3ba0ac5e5ab81df8debf0fa/sbin/mesos-slave core 1658 0.0 0.0 6740 832 pts/0S+ 21:46 0:00 grep --colour=auto -e slave core@ip-10-0-0-45 ~ $ sudo lsof -p 1124|grep -e 2181 mesos-sla 1124 root 10u IPv4 21973 0t0TCP ip-10-0-0-45.us-west-2.compute.internal:54016->ip-10-0-5-206.us-west-2.compute.internal:2181 (ESTABLISHED) {code} I only tested against mesos 0.24.1 and 0.25.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4065) slave FD for ZK tcp connection leaked to executor process
[ https://issues.apache.org/jira/browse/MESOS-4065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15046360#comment-15046360 ] James DeFelice commented on MESOS-4065: --- is this going to require a change to the zookeeper C client bindings? e.g. http://svn.apache.org/viewvc/zookeeper/branches/branch-3.5/src/c/src/zookeeper.c?view=markup somewhere around line 2203, adding a O_CLOEXEC to the socket() call? > slave FD for ZK tcp connection leaked to executor process > - > > Key: MESOS-4065 > URL: https://issues.apache.org/jira/browse/MESOS-4065 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.24.1, 0.25.0 >Reporter: James DeFelice > Labels: mesosphere, security > > {code} > core@ip-10-0-0-45 ~ $ ps auxwww|grep -e etcd > root 1432 99.3 0.0 202420 12928 ?Rsl 21:32 13:51 > ./etcd-mesos-executor -log_dir=./ > root 1450 0.4 0.1 38332 28752 ?Sl 21:32 0:03 ./etcd > --data-dir=etcd_data --name=etcd-1449178273 > --listen-peer-urls=http://10.0.0.45:1025 > --initial-advertise-peer-urls=http://10.0.0.45:1025 > --listen-client-urls=http://10.0.0.45:1026 > --advertise-client-urls=http://10.0.0.45:1026 > --initial-cluster=etcd-1449178273=http://10.0.0.45:1025,etcd-1449178271=http://10.0.2.95:1025,etcd-1449178272=http://10.0.2.216:1025 > --initial-cluster-state=existing > core 1651 0.0 0.0 6740 928 pts/0S+ 21:46 0:00 grep > --colour=auto -e etcd > core@ip-10-0-0-45 ~ $ sudo lsof -p 1432|grep -e 2181 > etcd-meso 1432 root 10u IPv4 21973 0t0TCP > ip-10-0-0-45.us-west-2.compute.internal:54016->ip-10-0-5-206.us-west-2.compute.internal:2181 > (ESTABLISHED) > core@ip-10-0-0-45 ~ $ ps auxwww|grep -e slave > root 1124 0.2 0.1 900496 25736 ?Ssl 21:11 0:04 > /opt/mesosphere/packages/mesos--52cbecde74638029c3ba0ac5e5ab81df8debf0fa/sbin/mesos-slave > core 1658 0.0 0.0 6740 832 pts/0S+ 21:46 0:00 grep > --colour=auto -e slave > core@ip-10-0-0-45 ~ $ sudo lsof -p 1124|grep -e 2181 > mesos-sla 1124 root 10u IPv4 21973 0t0TCP > ip-10-0-0-45.us-west-2.compute.internal:54016->ip-10-0-5-206.us-west-2.compute.internal:2181 > (ESTABLISHED) > {code} > I only tested against mesos 0.24.1 and 0.25.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4065) slave FD for ZK tcp connection leaked to executor process
[ https://issues.apache.org/jira/browse/MESOS-4065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15046361#comment-15046361 ] James DeFelice commented on MESOS-4065: --- possible impact on SELinux systems? http://danwalsh.livejournal.com/53603.html?page=1 > slave FD for ZK tcp connection leaked to executor process > - > > Key: MESOS-4065 > URL: https://issues.apache.org/jira/browse/MESOS-4065 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.24.1, 0.25.0 >Reporter: James DeFelice > Labels: mesosphere, security > > {code} > core@ip-10-0-0-45 ~ $ ps auxwww|grep -e etcd > root 1432 99.3 0.0 202420 12928 ?Rsl 21:32 13:51 > ./etcd-mesos-executor -log_dir=./ > root 1450 0.4 0.1 38332 28752 ?Sl 21:32 0:03 ./etcd > --data-dir=etcd_data --name=etcd-1449178273 > --listen-peer-urls=http://10.0.0.45:1025 > --initial-advertise-peer-urls=http://10.0.0.45:1025 > --listen-client-urls=http://10.0.0.45:1026 > --advertise-client-urls=http://10.0.0.45:1026 > --initial-cluster=etcd-1449178273=http://10.0.0.45:1025,etcd-1449178271=http://10.0.2.95:1025,etcd-1449178272=http://10.0.2.216:1025 > --initial-cluster-state=existing > core 1651 0.0 0.0 6740 928 pts/0S+ 21:46 0:00 grep > --colour=auto -e etcd > core@ip-10-0-0-45 ~ $ sudo lsof -p 1432|grep -e 2181 > etcd-meso 1432 root 10u IPv4 21973 0t0TCP > ip-10-0-0-45.us-west-2.compute.internal:54016->ip-10-0-5-206.us-west-2.compute.internal:2181 > (ESTABLISHED) > core@ip-10-0-0-45 ~ $ ps auxwww|grep -e slave > root 1124 0.2 0.1 900496 25736 ?Ssl 21:11 0:04 > /opt/mesosphere/packages/mesos--52cbecde74638029c3ba0ac5e5ab81df8debf0fa/sbin/mesos-slave > core 1658 0.0 0.0 6740 832 pts/0S+ 21:46 0:00 grep > --colour=auto -e slave > core@ip-10-0-0-45 ~ $ sudo lsof -p 1124|grep -e 2181 > mesos-sla 1124 root 10u IPv4 21973 0t0TCP > ip-10-0-0-45.us-west-2.compute.internal:54016->ip-10-0-5-206.us-west-2.compute.internal:2181 > (ESTABLISHED) > {code} > I only tested against mesos 0.24.1 and 0.25.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4065) slave FD for ZK tcp connection leaked to executor process
[ https://issues.apache.org/jira/browse/MESOS-4065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15047064#comment-15047064 ] James DeFelice commented on MESOS-4065: --- https://issues.apache.org/jira/browse/ZOOKEEPER-2338 > slave FD for ZK tcp connection leaked to executor process > - > > Key: MESOS-4065 > URL: https://issues.apache.org/jira/browse/MESOS-4065 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.24.1, 0.25.0 >Reporter: James DeFelice > Labels: mesosphere, security > > {code} > core@ip-10-0-0-45 ~ $ ps auxwww|grep -e etcd > root 1432 99.3 0.0 202420 12928 ?Rsl 21:32 13:51 > ./etcd-mesos-executor -log_dir=./ > root 1450 0.4 0.1 38332 28752 ?Sl 21:32 0:03 ./etcd > --data-dir=etcd_data --name=etcd-1449178273 > --listen-peer-urls=http://10.0.0.45:1025 > --initial-advertise-peer-urls=http://10.0.0.45:1025 > --listen-client-urls=http://10.0.0.45:1026 > --advertise-client-urls=http://10.0.0.45:1026 > --initial-cluster=etcd-1449178273=http://10.0.0.45:1025,etcd-1449178271=http://10.0.2.95:1025,etcd-1449178272=http://10.0.2.216:1025 > --initial-cluster-state=existing > core 1651 0.0 0.0 6740 928 pts/0S+ 21:46 0:00 grep > --colour=auto -e etcd > core@ip-10-0-0-45 ~ $ sudo lsof -p 1432|grep -e 2181 > etcd-meso 1432 root 10u IPv4 21973 0t0TCP > ip-10-0-0-45.us-west-2.compute.internal:54016->ip-10-0-5-206.us-west-2.compute.internal:2181 > (ESTABLISHED) > core@ip-10-0-0-45 ~ $ ps auxwww|grep -e slave > root 1124 0.2 0.1 900496 25736 ?Ssl 21:11 0:04 > /opt/mesosphere/packages/mesos--52cbecde74638029c3ba0ac5e5ab81df8debf0fa/sbin/mesos-slave > core 1658 0.0 0.0 6740 832 pts/0S+ 21:46 0:00 grep > --colour=auto -e slave > core@ip-10-0-0-45 ~ $ sudo lsof -p 1124|grep -e 2181 > mesos-sla 1124 root 10u IPv4 21973 0t0TCP > ip-10-0-0-45.us-west-2.compute.internal:54016->ip-10-0-5-206.us-west-2.compute.internal:2181 > (ESTABLISHED) > {code} > I only tested against mesos 0.24.1 and 0.25.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-4120) Make DiscoveryInfo dynamically updatable
[ https://issues.apache.org/jira/browse/MESOS-4120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15051817#comment-15051817 ] James DeFelice edited comment on MESOS-4120 at 12/10/15 11:10 PM: -- >From a K8s integration perspective it's preferable that some sidecar framework >component could update the task's DiscoveryInfo vs. putting all of that >responsibility on the executor. was (Author: jdef): >From a K8s integration perspective it's preferable that some sidecar framework >component could update the a task's DiscoveryInfo vs. putting all of that >responsibility on the executor. > Make DiscoveryInfo dynamically updatable > > > Key: MESOS-4120 > URL: https://issues.apache.org/jira/browse/MESOS-4120 > Project: Mesos > Issue Type: Improvement >Reporter: Sargun Dhillon >Priority: Critical > Labels: mesosphere > > K8s tasks can dynamically update what they expose to make discoverable by the > cluster. Unfortunately, all DiscoveryInfo the cluster is immutable, at the > time of task start. > We would like to enable DiscoveryInfo to be dynamically updatable, so that > executors can change what they're advertising based on their internal state, > versus requiring DiscoveryInfo to be known prior to starting the tasks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4120) Make DiscoveryInfo dynamically updatable
[ https://issues.apache.org/jira/browse/MESOS-4120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15051817#comment-15051817 ] James DeFelice commented on MESOS-4120: --- >From a K8s integration perspective it's preferable that some sidecar framework >component could update the a task's DiscoveryInfo vs. putting all of that >responsibility on the executor. > Make DiscoveryInfo dynamically updatable > > > Key: MESOS-4120 > URL: https://issues.apache.org/jira/browse/MESOS-4120 > Project: Mesos > Issue Type: Improvement >Reporter: Sargun Dhillon >Priority: Critical > Labels: mesosphere > > K8s tasks can dynamically update what they expose to make discoverable by the > cluster. Unfortunately, all DiscoveryInfo the cluster is immutable, at the > time of task start. > We would like to enable DiscoveryInfo to be dynamically updatable, so that > executors can change what they're advertising based on their internal state, > versus requiring DiscoveryInfo to be known prior to starting the tasks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4120) Make DiscoveryInfo dynamically updatable
[ https://issues.apache.org/jira/browse/MESOS-4120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15054716#comment-15054716 ] James DeFelice commented on MESOS-4120: --- If there's no support for updates via sidecar then a significant consequence (of forcing service/endpoint metadata through a task's DiscoveryInfo) is that every kubelet in the cluster will have to watch all changes to all endpoints in the cluster. This will not scale. OTOH if a sidecar component can make updates to DiscoveryInfo then we can run a single instance of the sidecar (or sharded instances) and that should scale much, much nicer. In k8s there's no way to attach annotations to individual endpoint addresses, instead we'd need to somehow annotate the entire Endpoints struct with key/values such that taskId is reasonably discoverable by this sidecar. This is ugly and forces us to maintain the forked version of the custom k8s endpoint-controller that we maintain (which we ultimately want to drop support for). Overall I'd prefer to advertise service discovery metadata through a different channel than a task's DiscoveryInfo. > Make DiscoveryInfo dynamically updatable > > > Key: MESOS-4120 > URL: https://issues.apache.org/jira/browse/MESOS-4120 > Project: Mesos > Issue Type: Improvement >Affects Versions: 0.25.0, 0.26.0 >Reporter: Sargun Dhillon >Priority: Critical > Labels: mesosphere > > K8s tasks can dynamically update what they expose to make discoverable by the > cluster. Unfortunately, all DiscoveryInfo the cluster is immutable, at the > time of task start. > We would like to enable DiscoveryInfo to be dynamically updatable, so that > executors can change what they're advertising based on their internal state, > versus requiring DiscoveryInfo to be known prior to starting the tasks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4086) Containerizer logging modularization
[ https://issues.apache.org/jira/browse/MESOS-4086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15054825#comment-15054825 ] James DeFelice commented on MESOS-4086: --- I left this comment in the design doc too, related to the logging module API call that's invoked per-executor creation: {quote} I'd like to see the executor labels passed in here so that frameworks can advertise additional metadata for logging modules to take advantage of. for example, one or more LOGDIRSx=... vars could instruct the logging module to monitor additional directories in the sandbox for log files. {quote} > Containerizer logging modularization > > > Key: MESOS-4086 > URL: https://issues.apache.org/jira/browse/MESOS-4086 > Project: Mesos > Issue Type: Epic > Components: containerization, modules >Reporter: Joseph Wu >Assignee: Joseph Wu > Labels: logging, mesosphere > > Executors and tasks are configured (via the various containerizers) to write > their output (stdout/stderr) to files ("stdout" and "stderr") on an agent's > disk. > Unlike Master/Agent logs, executor/task logs are not attached to any formal > logging system, like {{glog}}. As such, there is significant scope for > improvement. > By introducing a module for logging, we can provide a common/programmatic way > to access and manage executor/task logs. Modules could implement additional > sinks for logs, such as: > * to the sandbox (the status quo), > * to syslog, > * to journald > This would also provide the hooks to deal with logging related problems, such > as: > * the (current) lack of log rotation, > * searching through executor/task logs (i.e. via aggregation) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-4548) Errors communicated to the scheduler should be associated with stable error codes.
James DeFelice created MESOS-4548: - Summary: Errors communicated to the scheduler should be associated with stable error codes. Key: MESOS-4548 URL: https://issues.apache.org/jira/browse/MESOS-4548 Project: Mesos Issue Type: Improvement Reporter: James DeFelice For example, in mesos 0.24 there was a change to the error message generated by the master when a previously removed framework attempts to re-register: https://github.com/apache/mesos/commit/8661672d80cbe3ebd05e68a6fc4167b54ea139ef Some frameworks, rightly or not, attempt to compare the generated error string to "Completed framework attempted to re-register" which changed in mesos 0.24 to "Framework has been removed". These frameworks are now broken with respect to this aspect of their error handling, at least until they're changed to check for the new error string. Arguably frameworks shouldn't be comparing error strings since they're not guaranteed to remain stable across releases. However, mesos currently offers no alternative since there's no error **code** in the API. Furthermore, with the rise of the HTTP API there's room for two classes of errors: synchronous validation errors vs. asynchronous errors. It would be ideal to have meaningful 4xx error code responses for synchronous errors as well as error codes for asynchronous errors delivered via ERROR events. These error codes would become part of a stable API that mesos would treat just like the rest of its APIs, allowing for deprecation cycles before breaking changes - or at the very least a release note indicating an immediate breaking change. /cc [~vinodkone] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4548) Errors communicated to the scheduler should be associated with stable error codes.
[ https://issues.apache.org/jira/browse/MESOS-4548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James DeFelice updated MESOS-4548: -- Affects Version/s: 0.24.0 > Errors communicated to the scheduler should be associated with stable error > codes. > -- > > Key: MESOS-4548 > URL: https://issues.apache.org/jira/browse/MESOS-4548 > Project: Mesos > Issue Type: Improvement >Affects Versions: 0.24.0 >Reporter: James DeFelice > Labels: mesosphere > > For example, in mesos 0.24 there was a change to the error message generated > by the master when a previously removed framework attempts to re-register: > https://github.com/apache/mesos/commit/8661672d80cbe3ebd05e68a6fc4167b54ea139ef > Some frameworks, rightly or not, attempt to compare the generated error > string to "Completed framework attempted to re-register" which changed in > mesos 0.24 to "Framework has been removed". These frameworks are now broken > with respect to this aspect of their error handling, at least until they're > changed to check for the new error string. > Arguably frameworks shouldn't be comparing error strings since they're not > guaranteed to remain stable across releases. However, mesos currently offers > no alternative since there's no error **code** in the API. > Furthermore, with the rise of the HTTP API there's room for two classes of > errors: synchronous validation errors vs. asynchronous errors. It would be > ideal to have meaningful 4xx error code responses for synchronous errors as > well as error codes for asynchronous errors delivered via ERROR events. These > error codes would become part of a stable API that mesos would treat just > like the rest of its APIs, allowing for deprecation cycles before breaking > changes - or at the very least a release note indicating an immediate > breaking change. > /cc [~vinodkone] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4548) Errors communicated to the scheduler should be associated with stable error codes.
[ https://issues.apache.org/jira/browse/MESOS-4548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James DeFelice updated MESOS-4548: -- Description: For example, in mesos 0.24 there was a change to the error message generated by the master when a previously removed framework attempts to re-register: https://github.com/apache/mesos/commit/8661672d80cbe3ebd05e68a6fc4167b54ea139ef Some frameworks, rightly or not, attempt to compare the generated error string to "Completed framework attempted to re-register" which changed in mesos 0.24 to "Framework has been removed". These frameworks are now broken with respect to this aspect of their error handling, at least until they're changed to check for the new error string. Arguably frameworks shouldn't be comparing error strings since they're not guaranteed to remain stable across releases. However, mesos currently offers no alternative since there's no error **code** in the API. Furthermore, with the rise of the HTTP API there's room for two classes of errors: synchronous validation errors vs. asynchronous errors. It would be ideal to have meaningful 4xx error code responses for synchronous errors as well as error codes for asynchronous errors delivered via ERROR events. These error codes would become part of a stable API that mesos would treat just like the rest of its APIs, allowing for deprecation cycles before breaking changes - or at the very least a release note indicating an immediate breaking change. /cc [~vinodkone], [~bmahler] was: For example, in mesos 0.24 there was a change to the error message generated by the master when a previously removed framework attempts to re-register: https://github.com/apache/mesos/commit/8661672d80cbe3ebd05e68a6fc4167b54ea139ef Some frameworks, rightly or not, attempt to compare the generated error string to "Completed framework attempted to re-register" which changed in mesos 0.24 to "Framework has been removed". These frameworks are now broken with respect to this aspect of their error handling, at least until they're changed to check for the new error string. Arguably frameworks shouldn't be comparing error strings since they're not guaranteed to remain stable across releases. However, mesos currently offers no alternative since there's no error **code** in the API. Furthermore, with the rise of the HTTP API there's room for two classes of errors: synchronous validation errors vs. asynchronous errors. It would be ideal to have meaningful 4xx error code responses for synchronous errors as well as error codes for asynchronous errors delivered via ERROR events. These error codes would become part of a stable API that mesos would treat just like the rest of its APIs, allowing for deprecation cycles before breaking changes - or at the very least a release note indicating an immediate breaking change. /cc [~vinodkone] > Errors communicated to the scheduler should be associated with stable error > codes. > -- > > Key: MESOS-4548 > URL: https://issues.apache.org/jira/browse/MESOS-4548 > Project: Mesos > Issue Type: Improvement >Affects Versions: 0.24.0 >Reporter: James DeFelice > Labels: mesosphere > > For example, in mesos 0.24 there was a change to the error message generated > by the master when a previously removed framework attempts to re-register: > https://github.com/apache/mesos/commit/8661672d80cbe3ebd05e68a6fc4167b54ea139ef > Some frameworks, rightly or not, attempt to compare the generated error > string to "Completed framework attempted to re-register" which changed in > mesos 0.24 to "Framework has been removed". These frameworks are now broken > with respect to this aspect of their error handling, at least until they're > changed to check for the new error string. > Arguably frameworks shouldn't be comparing error strings since they're not > guaranteed to remain stable across releases. However, mesos currently offers > no alternative since there's no error **code** in the API. > Furthermore, with the rise of the HTTP API there's room for two classes of > errors: synchronous validation errors vs. asynchronous errors. It would be > ideal to have meaningful 4xx error code responses for synchronous errors as > well as error codes for asynchronous errors delivered via ERROR events. These > error codes would become part of a stable API that mesos would treat just > like the rest of its APIs, allowing for deprecation cycles before breaking > changes - or at the very least a release note indicating an immediate > breaking change. > /cc [~vinodkone], [~bmahler] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-4565) slave recovers and attempt to destroy executor's child containers, then begins rejecting task status updates
James DeFelice created MESOS-4565: - Summary: slave recovers and attempt to destroy executor's child containers, then begins rejecting task status updates Key: MESOS-4565 URL: https://issues.apache.org/jira/browse/MESOS-4565 Project: Mesos Issue Type: Bug Affects Versions: 0.26.0 Reporter: James DeFelice AFAICT the slave is doing this: 1) recovering from some kind of failure 2) checking the containers that it pulled from its state store 3) complaining about cgroup children hanging off of executor containers 4) rejecting task status updates related to the executor container, the first of which in the logs is: {code} E0130 02:22:21.979852 12683 slave.cpp:2963] Failed to update resources for container 1d965a20-849c-40d8-9446-27cb723220a9 of executor 'd701ab48a0c0f13_k8sm-executor' running task pod.f2dc2c43-c6f7-11e5-ad28-0ad18c5e6c7f on status update for terminal task, destroying container: Container '1d965a20-849c-40d8-9446-27cb723220a9' not found {code} To be fair, I don't believe that my custom executor is re-registering properly with the slave prior to attempting to send these (failing) status updates. But the slave doesn't complain about that .. it complains that it can't find the **container**. slave log here: https://gist.github.com/jdef/265663461156b7a7ed4e -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-416) Ensure master / slave do not get kernel OOM before executors, by setting oom_adj control.
[ https://issues.apache.org/jira/browse/MESOS-416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15127272#comment-15127272 ] James DeFelice commented on MESOS-416: -- AKA oom_score_adj ? > Ensure master / slave do not get kernel OOM before executors, by setting > oom_adj control. > - > > Key: MESOS-416 > URL: https://issues.apache.org/jira/browse/MESOS-416 > Project: Mesos > Issue Type: Improvement >Reporter: Benjamin Mahler > Labels: twitter > > We can adjust the /proc//oom_adj control during master / slave startup, > setting it to a low value to ensure we aren't killed first during an OOM. > Relevant LWN article: http://lwn.net/Articles/317814/ > Also relevant: https://bugzilla.redhat.com/show_bug.cgi?id=239313 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-416) Ensure master / slave do not get kernel OOM before executors, by setting oom_adj control.
[ https://issues.apache.org/jira/browse/MESOS-416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James DeFelice updated MESOS-416: - Labels: mesosphere security twitter (was: mesosphere twitter) > Ensure master / slave do not get kernel OOM before executors, by setting > oom_adj control. > - > > Key: MESOS-416 > URL: https://issues.apache.org/jira/browse/MESOS-416 > Project: Mesos > Issue Type: Improvement >Reporter: Benjamin Mahler > Labels: mesosphere, security, twitter > > We can adjust the /proc//oom_adj control during master / slave startup, > setting it to a low value to ensure we aren't killed first during an OOM. > Relevant LWN article: http://lwn.net/Articles/317814/ > Also relevant: https://bugzilla.redhat.com/show_bug.cgi?id=239313 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-416) Ensure master / slave do not get kernel OOM before executors, by setting oom_adj control.
[ https://issues.apache.org/jira/browse/MESOS-416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15127282#comment-15127282 ] James DeFelice commented on MESOS-416: -- FWIW kubernetes is doing this already for its important procs > Ensure master / slave do not get kernel OOM before executors, by setting > oom_adj control. > - > > Key: MESOS-416 > URL: https://issues.apache.org/jira/browse/MESOS-416 > Project: Mesos > Issue Type: Improvement >Reporter: Benjamin Mahler > Labels: mesosphere, security, twitter > > We can adjust the /proc//oom_adj control during master / slave startup, > setting it to a low value to ensure we aren't killed first during an OOM. > Relevant LWN article: http://lwn.net/Articles/317814/ > Also relevant: https://bugzilla.redhat.com/show_bug.cgi?id=239313 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4565) slave recovers and attempt to destroy executor's child containers, then begins rejecting task status updates
[ https://issues.apache.org/jira/browse/MESOS-4565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15131778#comment-15131778 ] James DeFelice commented on MESOS-4565: --- To be clear the custom executor in this case is using the native containerizer, not the docker one. > slave recovers and attempt to destroy executor's child containers, then > begins rejecting task status updates > > > Key: MESOS-4565 > URL: https://issues.apache.org/jira/browse/MESOS-4565 > Project: Mesos > Issue Type: Bug > Components: docker >Affects Versions: 0.26.0 >Reporter: James DeFelice > Labels: mesosphere > > AFAICT the slave is doing this: > 1) recovering from some kind of failure > 2) checking the containers that it pulled from its state store > 3) complaining about cgroup children hanging off of executor containers > 4) rejecting task status updates related to the executor container, the first > of which in the logs is: > {code} > E0130 02:22:21.979852 12683 slave.cpp:2963] Failed to update resources for > container 1d965a20-849c-40d8-9446-27cb723220a9 of executor > 'd701ab48a0c0f13_k8sm-executor' running task > pod.f2dc2c43-c6f7-11e5-ad28-0ad18c5e6c7f on status update for terminal task, > destroying container: Container '1d965a20-849c-40d8-9446-27cb723220a9' not > found > {code} > To be fair, I don't believe that my custom executor is re-registering > properly with the slave prior to attempting to send these (failing) status > updates. But the slave doesn't complain about that .. it complains that it > can't find the **container**. > slave log here: > https://gist.github.com/jdef/265663461156b7a7ed4e -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-7492) Introduce a daemon manager in the agent.
[ https://issues.apache.org/jira/browse/MESOS-7492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16025484#comment-16025484 ] James DeFelice commented on MESOS-7492: --- aren't {{poll_interval}} and {{initial_delay}} baked into {{CheckInfo}} already? > Introduce a daemon manager in the agent. > > > Key: MESOS-7492 > URL: https://issues.apache.org/jira/browse/MESOS-7492 > Project: Mesos > Issue Type: Task >Reporter: Jie Yu > > Once we have standalone container support from the containerizer, we should > consider adding a daemon manager inside the agent. It'll be like 'monit', > 'upstart' or 'systemd', but with very limited functionalities. For instance, > as a start, the manager will simply always restart the daemons if the daemon > fails. It'll also try to cleanup unknown daemons. > This feature will be used to manage CSI plugin containers on the agent. > The daemon manager should have an interface allowing operators to "register" > a daemon with a name and a config of the daemon. The daemon manager is > responsible for restarting the daemon if it crashes until some one explicitly > "unregister" it. Some simple backoff and health check functionality should be > provided. > We probably need a small design doc for this. > {code} > message DaemonConfig { > optional ContainerInfo container; > optional CommandInfo command; > optional uint32 poll_interval; > optional uint32 initial_delay; > optional CheckInfo check; // For health check. > } > class DaemonManager > { > public: > Future register( > const ContainerID& containerId, > const DaemonConfig& config; > Future unregister(const ContainerID& containerId); > Future> ps(); > Future status(const ContainerID& containerId); > }; > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (MESOS-7605) UCR doesn't isolate uts namespace w/ host networking
James DeFelice created MESOS-7605: - Summary: UCR doesn't isolate uts namespace w/ host networking Key: MESOS-7605 URL: https://issues.apache.org/jira/browse/MESOS-7605 Project: Mesos Issue Type: Improvement Components: containerization Reporter: James DeFelice Docker's {{run}} command supports a {{--hostname}} parameter which impacts container isolation, even in {{host}} network mode: (via https://docs.docker.com/engine/reference/run/) {quote} Even in host network mode a container has its own UTS namespace by default. As such --hostname is allowed in host network mode and will only change the hostname inside the container. Similar to --hostname, the --add-host, --dns, --dns-search, and --dns-option options can be used in host network mode. {quote} I see no evidence that UCR offers a similar isolation capability. Related: the {{ContainerInfo}} protobuf has a {{hostname}} field which was initially added to support the Docker containerizer's use of the {{--hostname}} Docker {{run}} flag. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (MESOS-6556) Hostname support for the network/cni isolator.
[ https://issues.apache.org/jira/browse/MESOS-6556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16034991#comment-16034991 ] James DeFelice commented on MESOS-6556: --- {{hostname}} is only applied when there are container networks present. When using host-mode networking, the UTS namespace is not isolated and {{hostname}} is not applied to the container. Tracking via https://issues.apache.org/jira/browse/MESOS-7605 > Hostname support for the network/cni isolator. > -- > > Key: MESOS-6556 > URL: https://issues.apache.org/jira/browse/MESOS-6556 > Project: Mesos > Issue Type: Improvement > Components: containerization >Reporter: James Peach >Assignee: James Peach >Priority: Minor > Fix For: 1.2.0 > > > -Add a {{namespace/uts}} isolator for doing UTS namespace isolation without > using the CNI isolator.- > Update the {{network/cni}} isolator to set the hostname specified by the task > info. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (MESOS-7620) GET_VOLUMES call referenced in API docs, but the call doesn't exist
James DeFelice created MESOS-7620: - Summary: GET_VOLUMES call referenced in API docs, but the call doesn't exist Key: MESOS-7620 URL: https://issues.apache.org/jira/browse/MESOS-7620 Project: Mesos Issue Type: Bug Reporter: James DeFelice https://github.com/apache/mesos/blob/d624255394b864ed477838e32f9712d7e63fc86f/include/mesos/v1/master/master.proto#L150 {code} // Create persistent volumes on reserved resources. The request is forwarded // asynchronously to the Mesos agent where the reserved resources are located. // That asynchronous message may not be delivered or creating the volumes at // the agent might fail. Volume creation can be verified by sending a // `GET_VOLUMES` call. {code} It's either a documentation bug, or a missing/overlooked feature. /cc [~vinodkone] [~jieyu] -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (MESOS-7314) Add offer operations for converting disk resources
[ https://issues.apache.org/jira/browse/MESOS-7314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043082#comment-16043082 ] James DeFelice commented on MESOS-7314: --- Does this preserve reservations across conversion ops? If not, it probably should... > Add offer operations for converting disk resources > -- > > Key: MESOS-7314 > URL: https://issues.apache.org/jira/browse/MESOS-7314 > Project: Mesos > Issue Type: Task > Components: master >Reporter: Jan Schlicht >Assignee: Jan Schlicht > Labels: mesosphere > > One should be able to convert {{RAW}} and {{BLOCK}} disk resources into a > different types by applying operations to them. The offer operations and the > related validation and resource handling needs to be implemented. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (MESOS-7697) Mesos scheduler v1 HTTP API may generate 404 errors for temporary conditions
James DeFelice created MESOS-7697: - Summary: Mesos scheduler v1 HTTP API may generate 404 errors for temporary conditions Key: MESOS-7697 URL: https://issues.apache.org/jira/browse/MESOS-7697 Project: Mesos Issue Type: Bug Reporter: James DeFelice Returning a 404 error for a condition that's a known temporary condition is confusing from a client's perspective. A client wants to know how to recover from various error conditions. A 404 error condition should be distinct from a "server is not yet ready, but will be shortly" condition (which should probably be reported as a 503 "unavailable" error). https://github.com/apache/mesos/blob/72752fc6deb8ebcbfbd5448dc599ef3774339d31/src/scheduler/scheduler.cpp#L593 {code} if (response->code == process::http::Status::NOT_FOUND) { // This could happen if the master libprocess process has not yet set up // HTTP routes. LOG(WARNING) << "Received '" << response->status << "' (" << response->body << ") for " << call.type(); return; } {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-6638) Update Suppress and Revive to be per-role.
[ https://issues.apache.org/jira/browse/MESOS-6638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16063092#comment-16063092 ] James DeFelice commented on MESOS-6638: --- https://github.com/mesos/mesos-go/pull/304#issuecomment-311060962 SUPPRESS and REVIVE were already members of the scheduler Call.Type prior to 1.2; now that they have corresponding message object types, it's unclear if the message objects are actually required if there's no role to set. The docs are unclear on this > Update Suppress and Revive to be per-role. > -- > > Key: MESOS-6638 > URL: https://issues.apache.org/jira/browse/MESOS-6638 > Project: Mesos > Issue Type: Task > Components: scheduler api >Reporter: Benjamin Mahler >Assignee: Guangya Liu > Fix For: 1.2.0 > > > The {{SUPPRESS}} and {{REVIVE}} calls need to be updated to be per-role. I.e. > Include {{Revive.role}} and {{Suppress.role}} fields, indicating which role > the operation is being applied to. > {{Revive}} and {{Suppress}} messages do not currently exist, so these need to > be added. To support the old-style schedulers, we will make the role fields > optional. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-7211) Document SUPPRESS HTTP call
[ https://issues.apache.org/jira/browse/MESOS-7211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James DeFelice updated MESOS-7211: -- Labels: mesosphere newbie (was: newbie) > Document SUPPRESS HTTP call > --- > > Key: MESOS-7211 > URL: https://issues.apache.org/jira/browse/MESOS-7211 > Project: Mesos > Issue Type: Documentation > Components: documentation >Affects Versions: 1.1.0 >Reporter: Bruce Merry >Priority: Minor > Labels: mesosphere, newbie > > The documentation at > http://mesos.apache.org/documentation/latest/scheduler-http-api/ doesn't list > the SUPPRESS call as one of the call types, but it does seem to be > implemented. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (MESOS-7734) More consistent/strict validation of role names please
James DeFelice created MESOS-7734: - Summary: More consistent/strict validation of role names please Key: MESOS-7734 URL: https://issues.apache.org/jira/browse/MESOS-7734 Project: Mesos Issue Type: Improvement Reporter: James DeFelice As per the currently implemented role validation rules: https://github.com/apache/mesos/blob/63e08146aa7aa8efac3928922b6cdef92aa1d2ce/src/common/roles.cpp#L71 ... the following role names are allowed: {code} eng- eng* *eng eng. eng.. ... ..eng {code} The `/` character has good validation semantics around it and it's a much less confusing character to use when composing hierarchical role names. The `*`, `.`, and `-` characters have specific validation rules within narrow context, but it's too easy for someone to compose confusing role names using these characters. IMO, validation should severely restrict the context in which the `*` and `.` characters are used, as well as implement a symmetrical "endWith" check for `-` characters. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-7675) Isolate network ports.
[ https://issues.apache.org/jira/browse/MESOS-7675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16068462#comment-16068462 ] James DeFelice commented on MESOS-7675: --- Would this monitor only the network ports advertised as `ports` resources? Wondering about interaction with ephemeral ports. > Isolate network ports. > -- > > Key: MESOS-7675 > URL: https://issues.apache.org/jira/browse/MESOS-7675 > Project: Mesos > Issue Type: Improvement > Components: agent >Reporter: James Peach >Assignee: James Peach >Priority: Minor > > If a task uses network ports, there is no isolator that can enforce that it > only listens on the ports that it has resources for. Implement a ports > isolator that can limit tasks to listen only on allocated TCP ports. > Roughly, the algorithm for this follows what standard tools like {{lsof}} and > {{ss}} do. > * Find all the listening TCP sockets (using netlink) > * Index the sockets by their node (from the netlink information) > * Find all the open sockets on the system (by scanning {{/proc/\*/fd/\*}} > links) > * For each open socket, check whether its node (given in the link target) in > the set of listen sockets that we scanned > * If the socket is a listening socket and the corresponding PID is in the > task, send a resource limitation for the task > Matching pids to tasks depends on using cgroup isolation, otherwise we would > have to build a full process tree, which would be nice to avoid. > Scanning all the open sockets can be avoided by using the {{net_cls}} > isolator with kernel + libnl3 patches to publish the socket classid when we > find the listening socket. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-7752) Command executor still active after terminal task state update.
[ https://issues.apache.org/jira/browse/MESOS-7752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James DeFelice updated MESOS-7752: -- Labels: mesosphere (was: ) > Command executor still active after terminal task state update. > --- > > Key: MESOS-7752 > URL: https://issues.apache.org/jira/browse/MESOS-7752 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.3.0 >Reporter: A. Dukhovniy > Labels: mesosphere > > Here is a rather simple scenario to reproduce this error: > * Frameworks starts a task with taskId = _task1_ > * Framework kills _task1_ *successfully* and *acknowledges* TASK_KILLED > * Framework starts another task with the same _task1_ but receives > "_TASK_FAILED (Attempted to run multiple tasks using a "command" executor)_" > *Note*: this test is racy so this scenario fails occasionally. > *Here is a full log* from that show a life-cycle of a task id > _app-restart-resident-app-with-five-instances.8882bd16-5fdd-11e7-a00e-0242aceef95c_: > {code:java} > # Starting... > WARN [10:51:14 ResidentTaskIntegrationTest-MesosMaster-32782] I0703 > 10:51:14.476085 14666 master.cpp:3352] Authorizing framework principal > 'principal' to launch task > app-restart-resident-app-with-five-instances.8882bd16-5fdd-11e7-a00e-0242aceef95c > WARN [10:51:14 ResidentTaskIntegrationTest-MesosMaster-32782] I0703 > 10:51:14.510136 14666 master.cpp:4426] Launching task > app-restart-resident-app-with-five-instances.8882bd16-5fdd-11e7-a00e-0242aceef95c > of framework 76d8f3e7-8f3a-4764-bb7d-2bcf8e85e2be- (marathon) at > scheduler-6dbbac16-7355-4a33-aee6-b9697c83e51c@127.0.1.1:61567 with > resources... > WARN [10:51:14 ResidentTaskIntegrationTest-MesosAgent-32788] I0703 > 10:51:14.513908 14697 slave.cpp:2118] Queued task > 'app-restart-resident-app-with-five-instances.8882bd16-5fdd-11e7-a00e-0242aceef95c' > for executor > 'app-restart-resident-app-with-five-instances.8882bd16-5fdd-11e7-a00e-0242aceef95c' > of framework 76d8f3e7-8f3a-4764-bb7d-2bcf8e85e2be- > WARN [10:51:15 ResidentTaskIntegrationTest-MesosMaster-32782] I0703 > 10:51:15.011696 14671 master.cpp:6222] Forwarding status update TASK_RUNNING > (UUID: ed2d0475-9d83-4e09-9f54-5b4d323e4558) for task > app-restart-resident-app-with-five-instances.8882bd16-5fdd-11e7-a00e-0242aceef95c > of framework 76d8f3e7-8f3a-4764-bb7d-2bcf8e85e2be- > WARN [10:51:15 ResidentTaskIntegrationTest-MesosMaster-32782] I0703 > 10:51:15.036391 14671 master.cpp:5092] Processing ACKNOWLEDGE call > ed2d0475-9d83-4e09-9f54-5b4d323e4558 for task > app-restart-resident-app-with-five-instances.8882bd16-5fdd-11e7-a00e-0242aceef95c > of framework 76d8f3e7-8f3a-4764-bb7d-2bcf8e85e2be- (marathon) at > scheduler-6dbbac16-7355-4a33-aee6-b9697c83e51c@127.0.1.1:61567 on agent > 76d8f3e7-8f3a-4764-bb7d-2bcf8e85e2be-S0 > {code} > {code:java} > # Killing... > DEBUG[10:51:15 ResidentTaskIntegrationTest-LocalMarathon-32800] WARN > [10:51:15 KillAction$] Killing known task > [app-restart-resident-app-with-five-instances.8882bd16-5fdd-11e7-a00e-0242aceef95c] > of instance instance > [app-restart-resident-app-with-five-instances.marathon-8882bd16-5fdd-11e7-a00e-0242aceef95c] > WARN [10:51:15 ResidentTaskIntegrationTest-MesosAgent-32788] I0703 > 10:51:15.196702 14697 slave.cpp:3816] Handling status update TASK_KILLED > (UUID: f7e9d0bc-726c-43aa-9ddc-3b082a68642e) for task > app-restart-resident-app-with-five-instances.8882bd16-5fdd-11e7-a00e-0242aceef95c > of framework 76d8f3e7-8f3a-4764-bb7d-2bcf8e85e2be- from > executor(1)@172.16.10.121:35184 > WARN [10:51:15 ResidentTaskIntegrationTest-MesosAgent-32788] I0703 > 10:51:15.197676 14697 slave.cpp:4166] Sending acknowledgement for status > update TASK_KILLED (UUID: f7e9d0bc-726c-43aa-9ddc-3b082a68642e) for task > app-restart-resident-app-with-five-instances.8882bd16-5fdd-11e7-a00e-0242aceef95c > of framework 76d8f3e7-8f3a-4764-bb7d-2bcf8e85e2be- to > executor(1)@172.16.10.121:35184 > WARN [10:51:15 ResidentTaskIntegrationTest-MesosMaster-32782] I0703 > 10:51:15.198299 14671 master.cpp:6154] Status update TASK_KILLED (UUID: > f7e9d0bc-726c-43aa-9ddc-3b082a68642e) for task > app-restart-resident-app-with-five-instances.8882bd16-5fdd-11e7-a00e-0242aceef95c > of framework 76d8f3e7-8f3a-4764-bb7d-2bcf8e85e2be- from agent > 76d8f3e7-8f3a-4764-bb7d-2bcf8e85e2be-S0 at slave(1)@172.16.10.121:32788 > (172.16.10.121) > DEBUG[10:51:15 ResidentTaskIntegrationTest-LocalMarathon-32800] INFO > [10:51:15 MarathonScheduler] Received status update for task > app-restart-resident-app-with-five-instances.8882bd16-5fdd-11e7-a00e-0242aceef95c: > TASK_KILLED (Command terminated with signal Terminated) > WARN [10:51:15 ResidentTaskIntegrationTest-MesosMaster-32782] I0703 > 10:51:15.216081 14671 master.cp
[jira] [Updated] (MESOS-7171) Mesos Containerizer Change Size of SHM
[ https://issues.apache.org/jira/browse/MESOS-7171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James DeFelice updated MESOS-7171: -- Labels: mesosphere (was: ) > Mesos Containerizer Change Size of SHM > -- > > Key: MESOS-7171 > URL: https://issues.apache.org/jira/browse/MESOS-7171 > Project: Mesos > Issue Type: Improvement >Reporter: Miguel Bernadin >Assignee: Joseph Wu >Priority: Minor > Labels: mesosphere > > like the ability to adjust the size of the shared memory device just like > this can be performed on docker. > For example: To be able to change this on docker you can specify how much > space you would like to allocate as a parameter in the app definition in > marathon. > {code} > "parameters": [ > { > "key": "shm-size", > "value": "256mb" > } > {code} > As you can see below, here is an example of a container running and how much > space is available on disk reflecting this change. > Modified Parameter Container: > {code} > { > "id": "/ubuntu-withshm", > "cmd": "sleep 1000\n", > "cpus": 1, > "mem": 128, > "disk": 0, > "instances": 1, > "container": { > "type": "DOCKER", > "volumes": [], > "docker": { > "image": "ubuntu", > "network": "HOST", > "privileged": false, > "parameters": [ > { > "key": "shm-size", > "value": "256mb" > } > ], > "forcePullImage": false > } > }, > "portDefinitions": [ > { > "port": 10005, > "protocol": "tcp", > "labels": {} > } > ] > } > {code} > Modified Parameter Container: > {code} > core@ip-10-0-0-19 ~ $ docker exec -it a818cf2277a5 bash > root@ip-10-0-0-19:/# df -h > Filesystem Size Used Avail Use% Mounted on > overlay 37G 2.0G 33G 6% / > tmpfs 7.4G 0 7.4G 0% /dev > tmpfs 7.4G 0 7.4G 0% /sys/fs/cgroup > /dev/xvdb37G 2.0G 33G 6% /etc/hostname > shm 256M 0 256M 0% /dev/shm > {code} > Standard Container: > {code} > { > "id": "/ubuntu-withoutshm", > "cmd": "sleep 1", > "cpus": 1, > "mem": 128, > "disk": 0, > "instances": 1, > "container": { > "type": "DOCKER", > "volumes": [], > "docker": { > "image": "ubuntu", > "network": "HOST", > "privileged": false, > "parameters": [], > "forcePullImage": false > } > }, > "portDefinitions": [ > { > "port": 10006, > "protocol": "tcp", > "labels": {} > } > ] > } > {code} > Standard Container: > {code} > root@ip-10-0-0-19:/# exit > exit > core@ip-10-0-0-19 ~ $ docker exec -it c85433062e78 bash > root@ip-10-0-0-19:/# df -h > Filesystem Size Used Avail Use% Mounted on > overlay 37G 2.0G 33G 6% / > tmpfs 7.4G 0 7.4G 0% /dev > tmpfs 7.4G 0 7.4G 0% /sys/fs/cgroup > /dev/xvdb37G 2.0G 33G 6% /etc/hostname > shm 64M 0 64M 0% /dev/shm > {code} > How can this be done on mesos containerizer? -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-7605) UCR doesn't isolate uts namespace w/ host networking
[ https://issues.apache.org/jira/browse/MESOS-7605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16082706#comment-16082706 ] James DeFelice commented on MESOS-7605: --- Re-opening this ticket for further discussion. If there are no container networks, there is no UTS namespace isolation, as per: https://github.com/apache/mesos/blob/9b69c09310cdb6d7cfca1284f60c3f1b422c77cc/src/slave/containerizer/mesos/isolators/network/cni/cni.cpp#L655 without such isolation, calls to `sethostname` from a container will impact the host netns, as per: https://linux.die.net/man/2/sethostname and https://linux.die.net/man/1/unshare {quote} UTS namespace setting hostname, domainname will not affect rest of the system (CLONE_NEWUTS flag), {quote} This is distinctly different from the Docker experience. It also implies that it's impossible to give a container permission to **bind** to a host network port without also giving it permission to **change the host's network name**. This feels like a security hole to me. > UCR doesn't isolate uts namespace w/ host networking > > > Key: MESOS-7605 > URL: https://issues.apache.org/jira/browse/MESOS-7605 > Project: Mesos > Issue Type: Improvement > Components: containerization >Reporter: James DeFelice > Labels: mesosphere > > Docker's {{run}} command supports a {{--hostname}} parameter which impacts > container isolation, even in {{host}} network mode: (via > https://docs.docker.com/engine/reference/run/) > {quote} > Even in host network mode a container has its own UTS namespace by default. > As such --hostname is allowed in host network mode and will only change the > hostname inside the container. Similar to --hostname, the --add-host, --dns, > --dns-search, and --dns-option options can be used in host network mode. > {quote} > I see no evidence that UCR offers a similar isolation capability. > Related: the {{ContainerInfo}} protobuf has a {{hostname}} field which was > initially added to support the Docker containerizer's use of the > {{--hostname}} Docker {{run}} flag. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (MESOS-7839) io switchboard: clarify expected behavior when using TTYInfo with the default executor of a TaskGroup
James DeFelice created MESOS-7839: - Summary: io switchboard: clarify expected behavior when using TTYInfo with the default executor of a TaskGroup Key: MESOS-7839 URL: https://issues.apache.org/jira/browse/MESOS-7839 Project: Mesos Issue Type: Bug Affects Versions: 1.2.1 Reporter: James DeFelice I executed a LaunchGroup operation with an Executor of a DEFAULT type and with TTYInfo set to a non-empty protobuf. The tasks of the group did not specify a ContainerInfo. Mesos "successfully" launched the task group and in the executor sandbox stderr reported {code} The io switchboard server failed: Failed redirecting stdout: Input/output error {code} ... which seems relatively uninformative. Mesos also returned TASK_RUNNING followed by TASK_FINISHED for the tasks in the launched group. This wasn't what I expected: my goal was to launch a pod and have a TTY attached to the first task in the group. After discussing with [~klueska] the solution to my problem was to specify TTYInfo for the container of the task within the group, not on the group's executor. But we agreed that Mesos could probably exhibit better behavior in the initial scenario that I tested. Some (mutually exclusive) possibilities for alternate Mesos behavior: (a) fail-fast: using the Default Executor with a task group doesn't support TTYInfo so Mesos should just refuse to launch the task group (and return an appropriate error code and message w/ a reasonable explanation). (b) support TTYInfo when using the Default Executor with a task group. The use case for this is unclear. (c) when using TTYInfo with the DefaultExecutor and task group, attach the TTY to the first task in the group. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-7492) Introduce a daemon manager in the agent.
[ https://issues.apache.org/jira/browse/MESOS-7492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16104551#comment-16104551 ] James DeFelice commented on MESOS-7492: --- Could we start with an even more minimal design and (a) get rid of the health check fields (poll_interval, initial_delay, and check), and (b) eliminate the auto-restart feature? We can add these later if/when needed as requirements and user stories evolve. There's lots of supervision tooling to choose from already and it's not clear to me that Mesos should spend the time reinventing this wheel right now. Also, supporting run-once daemon tasks actually supports **both** run-once and run-forever models (run-forever tasks just need **some** supervisor process above the actual service -- that supervisor doesn't need to be Mesos). > Introduce a daemon manager in the agent. > > > Key: MESOS-7492 > URL: https://issues.apache.org/jira/browse/MESOS-7492 > Project: Mesos > Issue Type: Task >Reporter: Jie Yu >Assignee: Joseph Wu > Labels: mesosphere, storage > > Once we have standalone container support from the containerizer, we should > consider adding a daemon manager inside the agent. It'll be like 'monit', > 'upstart' or 'systemd', but with very limited functionalities. For instance, > as a start, the manager will simply always restart the daemons if the daemon > fails. It'll also try to cleanup unknown daemons. > This feature will be used to manage CSI plugin containers on the agent. > The daemon manager should have an interface allowing operators to "register" > a daemon with a name and a config of the daemon. The daemon manager is > responsible for restarting the daemon if it crashes until some one explicitly > "unregister" it. Some simple backoff and health check functionality should be > provided. > We probably need a small design doc for this. > {code} > message DaemonConfig { > optional ContainerInfo container; > optional CommandInfo command; > optional uint32 poll_interval; > optional uint32 initial_delay; > optional CheckInfo check; // For health check. > } > class DaemonManager > { > public: > Future register( > const ContainerID& containerId, > const DaemonConfig& config; > Future unregister(const ContainerID& containerId); > Future> ps(); > Future status(const ContainerID& containerId); > }; > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-7492) Introduce a daemon manager in the agent.
[ https://issues.apache.org/jira/browse/MESOS-7492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16104557#comment-16104557 ] James DeFelice commented on MESOS-7492: --- Instead of health checks/auto-restart I'd actually like to see a way to adjust the "kill" signal that an agent will send to a daemon in order to shut it down. Especially if we want to support containerizing the various supervision systems that already exist in the wild (s6, systemd, etc). > Introduce a daemon manager in the agent. > > > Key: MESOS-7492 > URL: https://issues.apache.org/jira/browse/MESOS-7492 > Project: Mesos > Issue Type: Task >Reporter: Jie Yu >Assignee: Joseph Wu > Labels: mesosphere, storage > > Once we have standalone container support from the containerizer, we should > consider adding a daemon manager inside the agent. It'll be like 'monit', > 'upstart' or 'systemd', but with very limited functionalities. For instance, > as a start, the manager will simply always restart the daemons if the daemon > fails. It'll also try to cleanup unknown daemons. > This feature will be used to manage CSI plugin containers on the agent. > The daemon manager should have an interface allowing operators to "register" > a daemon with a name and a config of the daemon. The daemon manager is > responsible for restarting the daemon if it crashes until some one explicitly > "unregister" it. Some simple backoff and health check functionality should be > provided. > We probably need a small design doc for this. > {code} > message DaemonConfig { > optional ContainerInfo container; > optional CommandInfo command; > optional uint32 poll_interval; > optional uint32 initial_delay; > optional CheckInfo check; // For health check. > } > class DaemonManager > { > public: > Future register( > const ContainerID& containerId, > const DaemonConfig& config; > Future unregister(const ContainerID& containerId); > Future> ps(); > Future status(const ContainerID& containerId); > }; > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (MESOS-7974) Accept "application/recordio" type is rejected for master operator API SUBSCRIBE call
James DeFelice created MESOS-7974: - Summary: Accept "application/recordio" type is rejected for master operator API SUBSCRIBE call Key: MESOS-7974 URL: https://issues.apache.org/jira/browse/MESOS-7974 Project: Mesos Issue Type: Bug Affects Versions: 1.2.1 Reporter: James DeFelice The agent operator API supports for "application/recordio" for things like attach-container-output, which streams objects back to the caller. I expected the master operator API SUBSCRIBE call to work the same way, w/ Accept/Content-Type headers for "recordio" and Message-Accept/Message-Content-Type headers for json (or protobuf). This was not the case. Looking again at the master operator API documentation, SUBSCRIBE docs illustrate usage Accept and Content-Type headers for the "application/json" type. Not a "recordio" type. So my experience, as per the docs, seems expected. However, this is counter-intuitive since the whole point of adding the new Message-prefixed headers was to help callers consistently request (and differentiate) streaming responses from non-streaming responses in the v1 API. Please fix the master operator API implementation to also support the Message-prefixed headers w/ Accept/Content-Type set to "recordio". Observed on ubuntu w/ mesos package version 1.2.1-2.0.1 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-7974) Accept "application/recordio" type is rejected for master operator API SUBSCRIBE call
[ https://issues.apache.org/jira/browse/MESOS-7974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16164930#comment-16164930 ] James DeFelice commented on MESOS-7974: --- xref MESOS-6936 > Accept "application/recordio" type is rejected for master operator API > SUBSCRIBE call > - > > Key: MESOS-7974 > URL: https://issues.apache.org/jira/browse/MESOS-7974 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.2.1 >Reporter: James DeFelice > Labels: mesosphere > > The agent operator API supports for "application/recordio" for things like > attach-container-output, which streams objects back to the caller. I expected > the master operator API SUBSCRIBE call to work the same way, w/ > Accept/Content-Type headers for "recordio" and > Message-Accept/Message-Content-Type headers for json (or protobuf). This was > not the case. > Looking again at the master operator API documentation, SUBSCRIBE docs > illustrate usage Accept and Content-Type headers for the "application/json" > type. Not a "recordio" type. So my experience, as per the docs, seems > expected. However, this is counter-intuitive since the whole point of adding > the new Message-prefixed headers was to help callers consistently request > (and differentiate) streaming responses from non-streaming responses in the > v1 API. > Please fix the master operator API implementation to also support the > Message-prefixed headers w/ Accept/Content-Type set to "recordio". > Observed on ubuntu w/ mesos package version 1.2.1-2.0.1 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (MESOS-7976) v1/scheduler: Revive/Suppress role field should be marked experimental for 1.2.x branch
James DeFelice created MESOS-7976: - Summary: v1/scheduler: Revive/Suppress role field should be marked experimental for 1.2.x branch Key: MESOS-7976 URL: https://issues.apache.org/jira/browse/MESOS-7976 Project: Mesos Issue Type: Bug Affects Versions: 1.2.2, 1.2.1, 1.2.0 Reporter: James DeFelice Assignee: Benjamin Mahler The role field of the v1 scheduler API's Revive and Suppress call should have been marked as experimental since it was part of the experimental MULTI-ROLE feature. The field has replaced in 1.3.x by a "repeated string roles" field. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-8060) Introduce first class 'profile' for disk resources.
[ https://issues.apache.org/jira/browse/MESOS-8060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16197488#comment-16197488 ] James DeFelice commented on MESOS-8060: --- https://reviews.apache.org/r/62820/ > Introduce first class 'profile' for disk resources. > --- > > Key: MESOS-8060 > URL: https://issues.apache.org/jira/browse/MESOS-8060 > Project: Mesos > Issue Type: Task >Reporter: Jie Yu >Assignee: Jie Yu > > This is similar to storage classes. Instead of adding a bunch of storage > backend specific parameters (e.g., rotational, type, speed, etc.) into the > disk resources, and asking the frameworks to make scheduling decisions based > on those vendor specific parameters. We propose to use a level of indirection > here. > The operator will setup mappings between a profile name to a set of vendor > specific disk parameters. The framework will do disk selection based on > profile names. > The storage resource provider will provide a hook allowing operators to > customize the profile name assignment for disk resources. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-8078) Some fields went missing with no replacement in api/v1
[ https://issues.apache.org/jira/browse/MESOS-8078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James DeFelice updated MESOS-8078: -- Labels: mesosphere (was: ) > Some fields went missing with no replacement in api/v1 > -- > > Key: MESOS-8078 > URL: https://issues.apache.org/jira/browse/MESOS-8078 > Project: Mesos > Issue Type: Story > Components: HTTP API >Reporter: Dmitrii Rozhkov > Labels: mesosphere > > Hi friends, > These fields are available via the state.json but went missing in the v1 of > the API: > leader_info > start_time > elected_time > As we're showing them on the Overview page of the DC/OS UI, yet would like > not be using state.json, it would be great to have them somewhere in V1. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (MESOS-8169) master validation incorrectly rejects slaves, buggy executorID checking
James DeFelice created MESOS-8169: - Summary: master validation incorrectly rejects slaves, buggy executorID checking Key: MESOS-8169 URL: https://issues.apache.org/jira/browse/MESOS-8169 Project: Mesos Issue Type: Bug Affects Versions: 1.4.0 Reporter: James DeFelice Priority: Major proposed fix: https://github.com/apache/mesos/pull/248 I observed this in my environment, where I had two frameworks that used the same ExecutorID and then triggered a master failover. The master refuses to reregister the slave because it's not considering the owning-framework of the ExecutorID when computing ExecutorID uniqueness, and concludes (incorrectly) that there's an erroneous duplicate executor ID: {code} W1103 00:33:42.509891 19638 master.cpp:6008] Dropping re-registration of agent at slave(1)@10.2.0.7:5051 because it sent an invalid re-registration: Executor has a duplicate ExecutorID 'default' {code} (yes, "default" is probably a terrible name for an ExecutorID - that's a separate discussion!) /cc [~neilc] -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-8169) master validation incorrectly rejects slaves, buggy executorID checking
[ https://issues.apache.org/jira/browse/MESOS-8169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16237738#comment-16237738 ] James DeFelice commented on MESOS-8169: --- /cc [~jamespeach] > master validation incorrectly rejects slaves, buggy executorID checking > --- > > Key: MESOS-8169 > URL: https://issues.apache.org/jira/browse/MESOS-8169 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.4.0 >Reporter: James DeFelice >Priority: Major > Labels: mesosphere > > proposed fix: https://github.com/apache/mesos/pull/248 > I observed this in my environment, where I had two frameworks that used the > same ExecutorID and then triggered a master failover. The master refuses to > reregister the slave because it's not considering the owning-framework of the > ExecutorID when computing ExecutorID uniqueness, and concludes (incorrectly) > that there's an erroneous duplicate executor ID: > {code} > W1103 00:33:42.509891 19638 master.cpp:6008] Dropping re-registration of > agent at slave(1)@10.2.0.7:5051 because it sent an invalid re-registration: > Executor has a duplicate ExecutorID 'default' > {code} > (yes, "default" is probably a terrible name for an ExecutorID - that's a > separate discussion!) > /cc [~neilc] -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-8171) Using a failoverTimeout of 0 with Mesos native scheduler client can result in infinite subscribe loop
[ https://issues.apache.org/jira/browse/MESOS-8171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James DeFelice updated MESOS-8171: -- Labels: mesosphere (was: ) > Using a failoverTimeout of 0 with Mesos native scheduler client can result in > infinite subscribe loop > - > > Key: MESOS-8171 > URL: https://issues.apache.org/jira/browse/MESOS-8171 > Project: Mesos > Issue Type: Bug > Components: c++ api, java api, scheduler driver >Affects Versions: 1.1.3, 1.2.2, 1.3.1, 1.4.0 >Reporter: Tim Harper >Priority: Minor > Labels: mesosphere > > Over the past year, the Marathon team has been plagued with an issue that > hits our CI builds periodically in which the scheduler driver enters a tight > loop, sending 10,000s of SUBSCRIBE calls to the master per second. I turned > on debug logging for the client and the server, and it pointed to an issue > with the {{doReliableRegistration}} method in sched.cpp. Here's the logs: > {code} > WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 > 05:39:39.099815 13397 process.cpp:1383] libprocess is initialized on > 127.0.1.1:60957 with 8 worker threads > WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 > 05:39:39.118237 13397 logging.cpp:199] Logging to STDERR > WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 > 05:39:39.128921 13416 sched.cpp:232] Version: 1.4.0 > WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 > 05:39:39.151785 13791 group.cpp:341] Group process > (zookeeper-group(1)@127.0.1.1:60957) connected to ZooKeeper > WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 > 05:39:39.151823 13791 group.cpp:831] Syncing group operations: queue size > (joins, cancels, datas) = (0, 0, 0) > WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 > 05:39:39.151837 13791 group.cpp:419] Trying to create path '/mesos' in > ZooKeeper > WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 > 05:39:39.152586 13791 group.cpp:758] Found non-sequence node 'log_replicas' > at '/mesos' in ZooKeeper > WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 > 05:39:39.152662 13791 detector.cpp:152] Detected a new leader: (id='0') > WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 > 05:39:39.152762 13791 group.cpp:700] Trying to get > '/mesos/json.info_00' in ZooKeeper > WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 > 05:39:39.157148 13791 zookeeper.cpp:262] A new leading master > (UPID=master@172.16.10.95:32856) is detected > WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 > 05:39:39.157347 13787 sched.cpp:336] New master detected at > master@172.16.10.95:32856 > WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 > 05:39:39.157557 13787 sched.cpp:352] No credentials provided. Attempting to > register without authentication > WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 > 05:39:39.157565 13787 sched.cpp:836] Sending SUBSCRIBE call to > master@172.16.10.95:32856 > WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 > 05:39:39.157635 13787 sched.cpp:869] Will retry registration in 0ns if > necessary > WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 > 05:39:39.158979 13785 sched.cpp:836] Sending SUBSCRIBE call to > master@172.16.10.95:32856 > WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 > 05:39:39.159029 13785 sched.cpp:869] Will retry registration in 0ns if > necessary > WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 > 05:39:39.159265 13790 sched.cpp:836] Sending SUBSCRIBE call to > master@172.16.10.95:32856 > WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 > 05:39:39.159303 13790 sched.cpp:869] Will retry registration in 0ns if > necessary > WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 > 05:39:39.159479 13786 sched.cpp:836] Sending SUBSCRIBE call to > master@172.16.10.95:32856 > WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 > 05:39:39.159521 13786 sched.cpp:869] Will retry registration in 0ns if > necessary > WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 > 05:39:39.159622 13788 sched.cpp:836] Sending SUBSCRIBE call to > master@172.16.10.95:32856 > WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 > 05:39:39.159658 13788 sched.cpp:869] Will retry registration in 0ns if > necessary > WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 > 05:39:39.159749 13789 sched.cpp:836] Sending SUBSCRIBE call to > master@172.16.10.95:32856 > WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 > 05:39:39.159785 13789 sched.cpp:869] Will ret
[jira] [Created] (MESOS-8237) Mesos injects Resource.allocation_info for all resource offers, regardless of MULTI_ROLE opt-in
James DeFelice created MESOS-8237: - Summary: Mesos injects Resource.allocation_info for all resource offers, regardless of MULTI_ROLE opt-in Key: MESOS-8237 URL: https://issues.apache.org/jira/browse/MESOS-8237 Project: Mesos Issue Type: Bug Reporter: James DeFelice Assignee: Benjamin Mahler In support of MULTI_ROLE capable frameworks, a Resource.allocation_info field was added and the Resource math of the Mesos library was updated to check for matching allocation_info when checking for (in)equality, addability, subtractability, containment, etc. To compensate for these changes, the demo frameworks of Mesos were updated to set the allocation_info for Resource objects during the "matching phase" in which offers' resources are evaluated in order for the framework to launch tasks. The Mesos demo frameworks NEEDED to be updated because the Resource algebra within Mesos now depended on matching allocation_info fields of Resource objects when executing algebraic operations. See https://github.com/apache/mesos/commit/c20744a9976b5e83698e9c6062218abb4d2e6b25#diff-298cc6a77862b7ff3422cd06c215ef28R91 . This poses a unique problem for **external** libraries that both aim to support various frameworks, some that DO and some that DO NOT opt-in to the MULTI_ROLE capability; specifically those external libraries that implement Resource algebra that's consistent with what Mesos implements internally. One such example of a library is mesos-go, though there are undoubtedly others. The problem can be explained via this scenario: {quote} Flo's mesos-go framework is running well, it doesn't opt-in to MULTI_ROLE because it doesn't need multiple roles. His framework runs on a version of Mesos that existed prior to integration of MULTI_ROLE support. His DC operator upgrades the mesos cluster to the latest version. Flo rebuilds his framework on the latest version of mesos-go and re-launches it on the cluster. He observes that his framework receives offers, but rejects ALL of them. Digging into the code he realizes that Mesos is injecting allocation_info into Resource objects being offered to his framework, and mesos-go considers allocation_info when comparing Resource objects (because it's MULTI_ROLE compatible now), but his framework doesn't take this into consideration when preparing its own Resource objects prior to the "resource matching phase". The consequence is that Flo's framework is trying to match against Resources that will never align because his framework isn't setting an allocation_info that might possibly match the allocation_info that Mesos is always injecting - regardless of the MULTI_ROLE capability (or lack thereof in this case) of his framework. {quote} If Mesos were to strip the allocation_info from Resource objects, prior to offering them to non-multi-role frameworks, then the problem illustrated above would go away. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-3601) Formalize all headers and metadata for HTTP API Event Stream
[ https://issues.apache.org/jira/browse/MESOS-3601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James DeFelice updated MESOS-3601: -- Labels: api http mesosphere wireprotocol (was: api http wireprotocol) > Formalize all headers and metadata for HTTP API Event Stream > > > Key: MESOS-3601 > URL: https://issues.apache.org/jira/browse/MESOS-3601 > Project: Mesos > Issue Type: Improvement >Affects Versions: 0.24.0 > Environment: Mesos 0.24.0 >Reporter: Ben Whitehead > Labels: api, http, mesosphere, wireprotocol > > From an HTTP standpoint the current set of headers returned when connecting > to the HTTP scheduler API are insufficient. > {code:title=current headers} > HTTP/1.1 200 OK > Transfer-Encoding: chunked > Date: Wed, 30 Sep 2015 21:07:16 GMT > Content-Type: application/json > {code} > Since the response from mesos is intended to function as a stream > {{Connection: keep-alive}} should be specified so that the connection can > remain open. > If RecordIO is going to be applied to the messages, the headers should > include the information necessary for a client to be able to detect RecordIO > and setup it response handlers appropriately. > How RecordIO is expressed will come down to the semantics of what is actually > "Returned" as the response from {{POST /api/v1/scheduler}}. > h4. Proposal > One approach would be to leverage http as much as possible, having a client > specify an {{Accept-Encoding}} along with the {{Accept}} header to indicate > that it can handle RecordIO {{Content-Encoding}} of {{Content-Type}} > messages. (This approach allows for things like gzip to be woven in fairly > easily in the future) > For this approach I would expect the following: > {code:title=Request} > POST /api/v1/scheduler HTTP/1.1 > Host: localhost:5050 > Accept: application/x-protobuf > Accept-Encoding: recordio > Content-Type: application/x-protobuf > Content-Length: 35 > User-Agent: RxNetty Client > {code} > {code:title=Response} > HTTP/1.1 200 OK > Connection: keep-alive > Transfer-Encoding: chunked > Content-Type: application/x-protobuf > Content-Encoding: recordio > Cache-Control: no-transform > {code} > When Content-Encoding is used it is recommended to set {{Cache-Control: > no-transform}} to signal to any proxies that no transformation should be > applied to the the content encoding [Section 14.11 RFC > 2616|http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.11]. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3601) Formalize all headers and metadata for HTTP API Event Stream
[ https://issues.apache.org/jira/browse/MESOS-3601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15655531#comment-15655531 ] James DeFelice commented on MESOS-3601: --- what about something like this? {code} Content-type: application/json; streamFormat=record-io {code} (a) maintains backwards compat; and (b) provides a way for clients to determine if it's a a series/stream of json objects (and how they're packaged) vs a single object (would not include a streamFormat parameter here) there's already the concept of a "stream" in the via (via the Mesos-Stream-Id header) > Formalize all headers and metadata for HTTP API Event Stream > > > Key: MESOS-3601 > URL: https://issues.apache.org/jira/browse/MESOS-3601 > Project: Mesos > Issue Type: Improvement >Affects Versions: 0.24.0 > Environment: Mesos 0.24.0 >Reporter: Ben Whitehead > Labels: api, http, mesosphere, wireprotocol > > From an HTTP standpoint the current set of headers returned when connecting > to the HTTP scheduler API are insufficient. > {code:title=current headers} > HTTP/1.1 200 OK > Transfer-Encoding: chunked > Date: Wed, 30 Sep 2015 21:07:16 GMT > Content-Type: application/json > {code} > Since the response from mesos is intended to function as a stream > {{Connection: keep-alive}} should be specified so that the connection can > remain open. > If RecordIO is going to be applied to the messages, the headers should > include the information necessary for a client to be able to detect RecordIO > and setup it response handlers appropriately. > How RecordIO is expressed will come down to the semantics of what is actually > "Returned" as the response from {{POST /api/v1/scheduler}}. > h4. Proposal > One approach would be to leverage http as much as possible, having a client > specify an {{Accept-Encoding}} along with the {{Accept}} header to indicate > that it can handle RecordIO {{Content-Encoding}} of {{Content-Type}} > messages. (This approach allows for things like gzip to be woven in fairly > easily in the future) > For this approach I would expect the following: > {code:title=Request} > POST /api/v1/scheduler HTTP/1.1 > Host: localhost:5050 > Accept: application/x-protobuf > Accept-Encoding: recordio > Content-Type: application/x-protobuf > Content-Length: 35 > User-Agent: RxNetty Client > {code} > {code:title=Response} > HTTP/1.1 200 OK > Connection: keep-alive > Transfer-Encoding: chunked > Content-Type: application/x-protobuf > Content-Encoding: recordio > Cache-Control: no-transform > {code} > When Content-Encoding is used it is recommended to set {{Cache-Control: > no-transform}} to signal to any proxies that no transformation should be > applied to the the content encoding [Section 14.11 RFC > 2616|http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.11]. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-6595) As a Mesos user I want to launch processes that will run on every node in the cluster
James DeFelice created MESOS-6595: - Summary: As a Mesos user I want to launch processes that will run on every node in the cluster Key: MESOS-6595 URL: https://issues.apache.org/jira/browse/MESOS-6595 Project: Mesos Issue Type: Story Reporter: James DeFelice Some applicable use cases: - log collection - metrics and monitoring - service discovery It might also be useful to break this functionality down into: daemon processes for master nodes vs. daemon processes for agent nodes. There was some initial discussion and back-of-the-napkin design for this at Mesoscon this past year (with an emphasis on agent nodes) but I'm not aware that anything significant materialized from that. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6595) As a Mesos user I want to launch processes that will run on every node in the cluster
[ https://issues.apache.org/jira/browse/MESOS-6595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15668408#comment-15668408 ] James DeFelice commented on MESOS-6595: --- Marathon users have been asking for this feature for .. years: https://github.com/mesosphere/marathon/issues/846 > As a Mesos user I want to launch processes that will run on every node in the > cluster > - > > Key: MESOS-6595 > URL: https://issues.apache.org/jira/browse/MESOS-6595 > Project: Mesos > Issue Type: Story >Reporter: James DeFelice > Labels: mesosphere > > Some applicable use cases: > - log collection > - metrics and monitoring > - service discovery > It might also be useful to break this functionality down into: daemon > processes for master nodes vs. daemon processes for agent nodes. > There was some initial discussion and back-of-the-napkin design for this at > Mesoscon this past year (with an emphasis on agent nodes) but I'm not aware > that anything significant materialized from that. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-1571) Signal escalation timeout is not configurable
[ https://issues.apache.org/jira/browse/MESOS-1571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14325794#comment-14325794 ] James DeFelice commented on MESOS-1571: --- In the kubernetes-mesos framework, the executor Shutdown() implementation currently force-stop's the containers it's managing (which, to my understanding, sends SIGKILL). It manages Docker containers, which are normally given 10s to shut down gracefully before Docker sends a SIGKILL. That 10s timeout is not compatible with the default slave flag `executor_shudown_grace_timeout` value of mesos (3s). However if I change the value of that timeout to 20s to give the executor more time to gracefully kill things there's no way for the executor to reason about that because it has no idea of how much time it actually has. As a workaround I've considered looking up the slave PID from the environment and querying its state.json for the startup flags, and trying to make a decision based on that. That approach seems somewhat hackish and I'd much rather do something nicer. It would be great to have an environment var `MESOS_EXECUTOR_SHUTDOWN_GRACE_PERIOD` or something, provided by the slave containerizer, so that the executor can make a decision about whether to send (via Docker) a TERM (and wait 10s) or KILL signal. > Signal escalation timeout is not configurable > - > > Key: MESOS-1571 > URL: https://issues.apache.org/jira/browse/MESOS-1571 > Project: Mesos > Issue Type: Bug >Reporter: Niklas Quarfot Nielsen >Assignee: Alexander Rukletsov > > Even though the executor shutdown grace period is set to a larger interval, > the signal escalation timeout will still be 3 seconds. It should either be > configurable or dependent on EXECUTOR_SHUTDOWN_GRACE_PERIOD. > Thoughts? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2407) libprocess segfaults when using GLOG_v=2
[ https://issues.apache.org/jira/browse/MESOS-2407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339001#comment-14339001 ] James DeFelice commented on MESOS-2407: --- gmtime isn't thread-safe either. should use gmtime_r (http://linux.die.net/man/3/gmtime_r). Other projects have had the same problem, e.g. http://sourceforge.net/p/levent/bugs/_discuss/thread/465dad36/948e/attachment/0001-Avoid-use-of-non-threadsafe-locale-dependent-strftim.patch > libprocess segfaults when using GLOG_v=2 > > > Key: MESOS-2407 > URL: https://issues.apache.org/jira/browse/MESOS-2407 > Project: Mesos > Issue Type: Bug >Reporter: Vinod Kone >Priority: Blocker > > Found this while debugging MESOS-2403. Looks like a thread safety issue with > stream operator in Process::resume(). > {code} > [==] Running 1 test from 1 test case. > [--] Global test environment set-up. > [--] 1 test from MasterAllocatorTest/0, where TypeParam = > mesos::internal::master::allocator::MesosAllocator mesos::internal::master::allocator::DRFSorter> > > [ RUN ] MasterAllocatorTest/0.FrameworkReregistersFirst > I0226 00:48:28.159126 29931 process.cpp:2150] Spawned process > files@10.35.255.108:46621 > I0226 00:48:28.159194 29958 process.cpp:2160] Resuming > files@10.35.255.108:46621 at 2015-02-26 00:48:28.159184896+00:00 > I0226 00:48:28.159333 29958 process.cpp:2160] Resuming > help@10.35.255.108:46621 at 2015-02-26 00:48:28.159284992+00:00 > I0226 00:48:28.159343 29931 process.cpp:2150] Spawned process > hierarchical-allocator(25)@10.35.255.108:46621 > I0226 00:48:28.159418 29955 process.cpp:2160] Resuming > hierarchical-allocator(25)@10.35.255.108:46621 at 2015-02-26 > 00:48:28.159364864+00:00 > Using temporary directory > '/tmp/MasterAllocatorTest_0_FrameworkReregistersFirst_J8P9UO' > I0226 00:48:28.165838 29970 process.cpp:2117] Dropped / Lost event for PID: > hierarchical-allocator(22)@10.35.255.108:46621 > I0226 00:48:28.193131 29964 process.cpp:2160] Resuming > reaper(1)@10.35.255.108:46621 at 2015-02-26 00:48:28.193116928+00:00 > I0226 00:48:28.267730 29931 leveldb.cpp:176] Opened db in 107.694932ms > I0226 00:48:28.281376 29931 leveldb.cpp:183] Compacted db in 13.598726ms > I0226 00:48:28.281435 29931 leveldb.cpp:198] Created db iterator in 10363ns > I0226 00:48:28.281461 29931 leveldb.cpp:204] Seeked to beginning of db in > 1180ns > I0226 00:48:28.281491 29931 leveldb.cpp:273] Iterated through 0 keys in the > db in 328ns > I0226 00:48:28.281518 29931 replica.cpp:744] Replica recovered with log > positions 0 -> 0 with 1 holes and 0 unlearned > I0226 00:48:28.281559 29931 process.cpp:2150] Spawned process > log-replica(25)@10.35.255.108:46621 > I0226 00:48:28.281648 29959 process.cpp:2160] Resuming > log-replica(25)@10.35.255.108:46621 at 2015-02-26 00:48:28.281634048+00:00 > I0226 00:48:28.281716 29967 process.cpp:2160] Resuming > (257)@10.35.255.108:46621 at 2015-02-26 00:48:28.281709056+00:00 > I0226 00:48:28.281654 29931 process.cpp:2150] Spawned process > (257)@10.35.255.108:46621 > I0226 00:48:28.281837 29931 process.cpp:2150] Spawned process > log(25)@10.35.255.108:46621 > I0226 00:48:28.281843 29962 process.cpp:2160] Resuming > log(25)@10.35.255.108:46621 at 2015-02-26 00:48:28.281829120+00:00 > I0226 00:48:28.282060 29931 process.cpp:2150] Spawned process > log-reader(25)@10.35.255.108:46621 > I0226 00:48:28.282080 29948 process.cpp:2160] Resuming > log-reader(25)@10.35.255.108:46621 at 2015-02-26 00:48:28.282066944+00:00 > I0226 00:48:28.282142 29962 process.cpp:2150] Spawned process > log-recover(25)@10.35.255.108:46621 > I0226 00:48:28.282181 29962 process.cpp:2160] Resuming > log-writer(25)@10.35.255.108:46621 at 2015-02-26 00:48:28.282177024+00:00 > I0226 00:48:28.282162 29952 process.cpp:2160] Resuming > __gc__@10.35.255.108:46621 at 2015-02-26 00:48:28.282151936+00:00 > I0226 00:48:28.282162 29958 process.cpp:2160] Resuming > log-recover(25)@10.35.255.108:46621 at 2015-02-26 00:48:28.282151936+00:00 > I0226 00:48:28.282378 29958 recover.cpp:449] Starting replica recovery > I0226 00:48:28.282438 29954 process.cpp:2160] Resuming > log-replica(25)@10.35.255.108:46621 at 2015-02-26 00:48:28.282430976+00:00 > I0226 00:48:28.282512 29931 process.cpp:2150] Spawned process > log-writer(25)@10.35.255.108:46621 > I0226 00:48:28.282533 29968 process.cpp:2160] Resuming > log-recover(25)@10.35.255.108:46621 at 2015-02-26 00:48:28.282522880+00:00 > I0226 00:48:28.282591 29968 recover.cpp:475] Replica is in 4 status > I0226 00:48:28.282618 29950 process.cpp:2160] Resuming > metrics@10.35.255.108:46621 at 2015-02-26 00:48:28.282608128+00:00 > I0226 00:48:28.282716 29963 process.cpp:2160] Resuming > (258)@10.35.255.108:46621 at 2015-02-26 00:48:28.282684928+00:00 > I0226 00:48:28.28
[jira] [Commented] (MESOS-2162) Consider a C++ implementation of CoreOS AppContainer spec
[ https://issues.apache.org/jira/browse/MESOS-2162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14356934#comment-14356934 ] James DeFelice commented on MESOS-2162: --- FWIW the kubernetes team is also considering rocket support: https://github.com/GoogleCloudPlatform/kubernetes/issues/2725 > Consider a C++ implementation of CoreOS AppContainer spec > - > > Key: MESOS-2162 > URL: https://issues.apache.org/jira/browse/MESOS-2162 > Project: Mesos > Issue Type: Story > Components: containerization >Reporter: Dominic Hamon > Labels: mesosphere, twitter > > CoreOS have released a > [specification|https://github.com/coreos/rocket/blob/master/app-container/SPEC.md] > for a container abstraction as an alternative to Docker. They have also > released a reference implementation, [rocket|https://coreos.com/blog/rocket/]. > We should consider a C++ implementation of the specification to have parity > with the community and then use this implementation for our containerizer. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-2499) SOURCE_EXECUTOR not set properly in slave.cpp
James DeFelice created MESOS-2499: - Summary: SOURCE_EXECUTOR not set properly in slave.cpp Key: MESOS-2499 URL: https://issues.apache.org/jira/browse/MESOS-2499 Project: Mesos Issue Type: Bug Components: slave Reporter: James DeFelice Slave::statusUpdate attempts to set the Source of the TaskStatus to either SOURCE_SLAVE or SOURCE_EXECUTOR: https://github.com/apache/mesos/blob/0.21.0/src/slave/slave.cpp#L {code} TaskStatus status = update.status(); status.set_source(pid == UPID() ? TaskStatus::SOURCE_SLAVE : TaskStatus::SOURCE_EXECUTOR); {code} Unfortunately it make this change to a copy of the TaskStatus that's later discarded and the change is never saved to the parent StatusUpdate. This results in slave-forward()ed status updates that lack a proper source value. The likely fix is that the set_source() update should be invoked on a TaskStatus* that's obtained via StatusUpdate.mutable_status() so that the new source is saved properly. It's not clear to me if StatusUpdate should be obtained via mutable_update(). It would also be helpful to have a unit test for this scenario. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4812) Mesos fails to escape command health checks
[ https://issues.apache.org/jira/browse/MESOS-4812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James DeFelice updated MESOS-4812: -- Labels: health-check mesosphere (was: health-check) > Mesos fails to escape command health checks > --- > > Key: MESOS-4812 > URL: https://issues.apache.org/jira/browse/MESOS-4812 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.25.0 >Reporter: Lukas Loesche >Assignee: haosdent > Labels: health-check, mesosphere > Attachments: health_task.gif > > > As described in https://github.com/mesosphere/marathon/issues/ > I would like to run a command health check > {noformat} > /bin/bash -c " {noformat} > The health check fails because Mesos, while running the command inside double > quotes of a sh -c "" doesn't escape the double quotes in the command. > If I escape the double quotes myself the command health check succeeds. But > this would mean that the user needs intimate knowledge of how Mesos executes > his commands which can't be right. > I was told this is not a Marathon but a Mesos issue so am opening this JIRA. > I don't know if this only affects the command health check. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (MESOS-4812) Mesos fails to escape command health checks
[ https://issues.apache.org/jira/browse/MESOS-4812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15858038#comment-15858038 ] James DeFelice commented on MESOS-4812: --- would like to see some traction on this. several issues have been reported against Marathon, the latest of which is https://github.com/mesosphere/marathon/issues/5136 > Mesos fails to escape command health checks > --- > > Key: MESOS-4812 > URL: https://issues.apache.org/jira/browse/MESOS-4812 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.25.0 >Reporter: Lukas Loesche >Assignee: haosdent > Labels: health-check > Attachments: health_task.gif > > > As described in https://github.com/mesosphere/marathon/issues/ > I would like to run a command health check > {noformat} > /bin/bash -c " {noformat} > The health check fails because Mesos, while running the command inside double > quotes of a sh -c "" doesn't escape the double quotes in the command. > If I escape the double quotes myself the command health check succeeds. But > this would mean that the user needs intimate knowledge of how Mesos executes > his commands which can't be right. > I was told this is not a Marathon but a Mesos issue so am opening this JIRA. > I don't know if this only affects the command health check. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (MESOS-6951) Docker containerizer: mangled environment when env value contains LF byte
[ https://issues.apache.org/jira/browse/MESOS-6951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James DeFelice updated MESOS-6951: -- Labels: mesosphere (was: ) > Docker containerizer: mangled environment when env value contains LF byte > - > > Key: MESOS-6951 > URL: https://issues.apache.org/jira/browse/MESOS-6951 > Project: Mesos > Issue Type: Bug > Components: containerization >Reporter: Jan-Philip Gehrcke > Labels: mesosphere > > Consider this Marathon app definition: > {code} > { > "id": "/testapp", > "cmd": "env && tail -f /dev/null", > "env":{ > "TESTVAR":"line1\nline2" > }, > "cpus": 0.1, > "mem": 10, > "instances": 1, > "container": { > "type": "DOCKER", > "docker": { > "image": "alpine" > } > } > } > {code} > The JSON-encoded newline in the value of the {{TESTVAR}} environment variable > leads to a corrupted task environment. What follows is a subset of the > resulting task environment (as printed via {{env}}, i.e. in key=value > notation): > {code} > line2= > TESTVAR=line1 > {code} > That is, the trailing part of the intended value ended up being interpreted > as variable name, and only the leading part of the intended value was used as > actual value for {{TESTVAR}}. > Common application scenarios that would badly break with that involve > pretty-printed JSON documents or YAML documents passed along via the > environment. > Following the code and information flow led to the conclusion that Docker's > {{--env-file}} command line interface is the weak point in the flow. It is > currently used in Mesos' Docker containerizer for passing the environment to > the container: > {code} > argv.push_back("--env-file"); > argv.push_back(environmentFile); > {code} > (Ref: > [code|https://github.com/apache/mesos/blob/c0aee8cc10b1d1f4b2db5ff12b771372fdd5b1f3/src/docker/docker.cpp#L584]) > Docker's {{--env-file}} argument behavior is documented via > {quote} > The --env-file flag takes a filename as an argument > and expects each line to be in the VAR=VAL format, > {quote} > (Ref: https://docs.docker.com/engine/reference/commandline/run/) > That is, Docker identifies individual environment variable key/value pair > definitions based on newline bytes in that file which explains the observed > environment variable value fragmentation. Notably, Docker does not provide a > mechanism for escaping newline bytes in the values specified in this > environment file. > I think it is important to understand that Docker's {{--env-file}} mechanism > is ill-posed in the sense that it is not capable of transmitting the whole > range of environment variable values allowed by POSIX. That's what the Single > UNIX Specification, Version 3 has to say about environment variable values: > {quote} > the value shall be composed of characters from the > portable character set (except NUL and as indicated below). > {quote} > (Ref: http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap08.html) > About "The portable character set": > http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap06.html#tagtcjh_3 > It includes (among others) the LF byte. Understandably, the current Docker > {{--env-file}} behavior will not change, so this is not an issue that can be > deferred to Docker: https://github.com/docker/docker/issues/12997 > Notably, the {{--env-file}} method for communicating environment variables to > Docker containers was just recently introduced to Mesos as of > https://issues.apache.org/jira/browse/MESOS-6566, for not leaking secrets > through the process listing. Previously, we specified env key/value pairs on > the command line which leaked secrets to the process list and probably also > did not support the full range of valid environment variable values. > We need a solution that > 1) does not leak sensitive values (i.e. is compliant with MESOS-6566). > 2) allows for passing arbitrary environment variable values. > It seems that Docker's {{--env}} method can be used for that. It can be used > to define _just the names of the environment variables_ to-be-passed-along, > in which case the docker binary will read the corresponding values from its > own environment, which we can clearly prepare appropriately when we invoke > the corresponding child process. This method would still leak environment > variable _names_ to the process listing, but (especially if documented) this > should be fine. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (MESOS-6951) Docker containerizer: mangled environment when env value contains LF byte
[ https://issues.apache.org/jira/browse/MESOS-6951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15864613#comment-15864613 ] James DeFelice commented on MESOS-6951: --- xref https://reviews.apache.org/r/53877/#comment237227 It's actually a problem for more than just LFs. SPACE and TAB characters also generate errors for the docker env-file parser. In addition, docker uses special passthrough functionality for envvars in the env-file that have blank values. > Docker containerizer: mangled environment when env value contains LF byte > - > > Key: MESOS-6951 > URL: https://issues.apache.org/jira/browse/MESOS-6951 > Project: Mesos > Issue Type: Bug > Components: containerization >Reporter: Jan-Philip Gehrcke > Labels: mesosphere > > Consider this Marathon app definition: > {code} > { > "id": "/testapp", > "cmd": "env && tail -f /dev/null", > "env":{ > "TESTVAR":"line1\nline2" > }, > "cpus": 0.1, > "mem": 10, > "instances": 1, > "container": { > "type": "DOCKER", > "docker": { > "image": "alpine" > } > } > } > {code} > The JSON-encoded newline in the value of the {{TESTVAR}} environment variable > leads to a corrupted task environment. What follows is a subset of the > resulting task environment (as printed via {{env}}, i.e. in key=value > notation): > {code} > line2= > TESTVAR=line1 > {code} > That is, the trailing part of the intended value ended up being interpreted > as variable name, and only the leading part of the intended value was used as > actual value for {{TESTVAR}}. > Common application scenarios that would badly break with that involve > pretty-printed JSON documents or YAML documents passed along via the > environment. > Following the code and information flow led to the conclusion that Docker's > {{--env-file}} command line interface is the weak point in the flow. It is > currently used in Mesos' Docker containerizer for passing the environment to > the container: > {code} > argv.push_back("--env-file"); > argv.push_back(environmentFile); > {code} > (Ref: > [code|https://github.com/apache/mesos/blob/c0aee8cc10b1d1f4b2db5ff12b771372fdd5b1f3/src/docker/docker.cpp#L584]) > Docker's {{--env-file}} argument behavior is documented via > {quote} > The --env-file flag takes a filename as an argument > and expects each line to be in the VAR=VAL format, > {quote} > (Ref: https://docs.docker.com/engine/reference/commandline/run/) > That is, Docker identifies individual environment variable key/value pair > definitions based on newline bytes in that file which explains the observed > environment variable value fragmentation. Notably, Docker does not provide a > mechanism for escaping newline bytes in the values specified in this > environment file. > I think it is important to understand that Docker's {{--env-file}} mechanism > is ill-posed in the sense that it is not capable of transmitting the whole > range of environment variable values allowed by POSIX. That's what the Single > UNIX Specification, Version 3 has to say about environment variable values: > {quote} > the value shall be composed of characters from the > portable character set (except NUL and as indicated below). > {quote} > (Ref: http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap08.html) > About "The portable character set": > http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap06.html#tagtcjh_3 > It includes (among others) the LF byte. Understandably, the current Docker > {{--env-file}} behavior will not change, so this is not an issue that can be > deferred to Docker: https://github.com/docker/docker/issues/12997 > Notably, the {{--env-file}} method for communicating environment variables to > Docker containers was just recently introduced to Mesos as of > https://issues.apache.org/jira/browse/MESOS-6566, for not leaking secrets > through the process listing. Previously, we specified env key/value pairs on > the command line which leaked secrets to the process list and probably also > did not support the full range of valid environment variable values. > We need a solution that > 1) does not leak sensitive values (i.e. is compliant with MESOS-6566). > 2) allows for passing arbitrary environment variable values. > It seems that Docker's {{--env}} method can be used for that. It can be used > to define _just the names of the environment variables_ to-be-passed-along, > in which case the docker binary will read the corresponding values from its > own environment, which we can clearly prepare appropriately when we invoke > the corresponding child process. This method would still leak environment > variable _names_ to the process listing, but (especially if documented) this > should be fine. --
[jira] [Comment Edited] (MESOS-6951) Docker containerizer: mangled environment when env value contains LF byte
[ https://issues.apache.org/jira/browse/MESOS-6951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15864613#comment-15864613 ] James DeFelice edited comment on MESOS-6951 at 2/13/17 10:45 PM: - xref https://reviews.apache.org/r/53877/#comment237227 It's actually a problem for more than just LFs. SPACE and TAB characters also generate errors for the docker env-file parser. In addition, docker uses special passthrough functionality for envvars in the env-file that have blank values. see https://github.com/docker/docker/blob/d5fe259e121b3c86d1de0dae1760aafb48507ea9/runconfig/opts/envfile.go#L26 was (Author: jdef): xref https://reviews.apache.org/r/53877/#comment237227 It's actually a problem for more than just LFs. SPACE and TAB characters also generate errors for the docker env-file parser. In addition, docker uses special passthrough functionality for envvars in the env-file that have blank values. > Docker containerizer: mangled environment when env value contains LF byte > - > > Key: MESOS-6951 > URL: https://issues.apache.org/jira/browse/MESOS-6951 > Project: Mesos > Issue Type: Bug > Components: containerization >Reporter: Jan-Philip Gehrcke > Labels: mesosphere > > Consider this Marathon app definition: > {code} > { > "id": "/testapp", > "cmd": "env && tail -f /dev/null", > "env":{ > "TESTVAR":"line1\nline2" > }, > "cpus": 0.1, > "mem": 10, > "instances": 1, > "container": { > "type": "DOCKER", > "docker": { > "image": "alpine" > } > } > } > {code} > The JSON-encoded newline in the value of the {{TESTVAR}} environment variable > leads to a corrupted task environment. What follows is a subset of the > resulting task environment (as printed via {{env}}, i.e. in key=value > notation): > {code} > line2= > TESTVAR=line1 > {code} > That is, the trailing part of the intended value ended up being interpreted > as variable name, and only the leading part of the intended value was used as > actual value for {{TESTVAR}}. > Common application scenarios that would badly break with that involve > pretty-printed JSON documents or YAML documents passed along via the > environment. > Following the code and information flow led to the conclusion that Docker's > {{--env-file}} command line interface is the weak point in the flow. It is > currently used in Mesos' Docker containerizer for passing the environment to > the container: > {code} > argv.push_back("--env-file"); > argv.push_back(environmentFile); > {code} > (Ref: > [code|https://github.com/apache/mesos/blob/c0aee8cc10b1d1f4b2db5ff12b771372fdd5b1f3/src/docker/docker.cpp#L584]) > Docker's {{--env-file}} argument behavior is documented via > {quote} > The --env-file flag takes a filename as an argument > and expects each line to be in the VAR=VAL format, > {quote} > (Ref: https://docs.docker.com/engine/reference/commandline/run/) > That is, Docker identifies individual environment variable key/value pair > definitions based on newline bytes in that file which explains the observed > environment variable value fragmentation. Notably, Docker does not provide a > mechanism for escaping newline bytes in the values specified in this > environment file. > I think it is important to understand that Docker's {{--env-file}} mechanism > is ill-posed in the sense that it is not capable of transmitting the whole > range of environment variable values allowed by POSIX. That's what the Single > UNIX Specification, Version 3 has to say about environment variable values: > {quote} > the value shall be composed of characters from the > portable character set (except NUL and as indicated below). > {quote} > (Ref: http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap08.html) > About "The portable character set": > http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap06.html#tagtcjh_3 > It includes (among others) the LF byte. Understandably, the current Docker > {{--env-file}} behavior will not change, so this is not an issue that can be > deferred to Docker: https://github.com/docker/docker/issues/12997 > Notably, the {{--env-file}} method for communicating environment variables to > Docker containers was just recently introduced to Mesos as of > https://issues.apache.org/jira/browse/MESOS-6566, for not leaking secrets > through the process listing. Previously, we specified env key/value pairs on > the command line which leaked secrets to the process list and probably also > did not support the full range of valid environment variable values. > We need a solution that > 1) does not leak sensitive values (i.e. is compliant with MESOS-6566). > 2) allows for passing arbitrary environment variable values. > It seems t
[jira] [Updated] (MESOS-1971) Switch cgroups_limit_swap default to true
[ https://issues.apache.org/jira/browse/MESOS-1971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James DeFelice updated MESOS-1971: -- Labels: mesosphere (was: ) > Switch cgroups_limit_swap default to true > - > > Key: MESOS-1971 > URL: https://issues.apache.org/jira/browse/MESOS-1971 > Project: Mesos > Issue Type: Improvement >Reporter: Anton Lindström >Priority: Minor > Labels: mesosphere > > Switch cgroups_limit_swap to true per default, see MESOS-1662 for more > information. > Thanks! -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (MESOS-7243) CNI implementation assumptions don't align with NetworkInfo proto docs
James DeFelice created MESOS-7243: - Summary: CNI implementation assumptions don't align with NetworkInfo proto docs Key: MESOS-7243 URL: https://issues.apache.org/jira/browse/MESOS-7243 Project: Mesos Issue Type: Bug Reporter: James DeFelice The protobuf docs for NetworkInfo state that frameworks may request one or more IP addresses via the `ip_addresses` field: the actual `ip_address` and `protocol` may be left blank; one entry is required for each IP address requested at task-launch time. The CNI implementation doesn't check `ip_addresses` and provides 1 address by default. This behavior breaks with the docs. It's been suggested that it is "legal" for the CNI implementation to assume that if `ip_addresses` was completely empty, that would translate to `ip_addresses: [ {} ]` (requesting a single IP address). I've argued against this logic: by assuming such a default, it becomes impossible for a container to express interest in joining a network (namespace) but not actually be allocated an IP address. This might be an edge case, but it's one that's ruled out as soon as we assume empty-collection == give-me-one-address-please. FWIW the Marathon API has made this (empty = give me a default) mistake several times and it has burned us. Strongly urge caution here. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (MESOS-7326) ship a sample CNI config that provides a mesos-bridge network, analogous to the default bridge that ships with Docker
James DeFelice created MESOS-7326: - Summary: ship a sample CNI config that provides a mesos-bridge network, analogous to the default bridge that ships with Docker Key: MESOS-7326 URL: https://issues.apache.org/jira/browse/MESOS-7326 Project: Mesos Issue Type: Task Components: documentation, network Reporter: James DeFelice UCR supports port mapping and marathon now ships an API that lets users specify a `container/bridge` networking mode. by default, this maps to a CNI network called `mesos-bridge`. Mesos should, at least: # ship a sample CNI configuration file that requires minimal edits, if any, to enable a `mesos-bridge` CNI network for a vanilla mesos install; like docker, the default bridge defaults to a "host-private" mode of operation # clearly document the steps required to enable this default bridge network for simple mesos clusters /cc [~jieyu] -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (MESOS-7375) provide additional insight for framework developers re: GPU_RESOURCES capability
James DeFelice created MESOS-7375: - Summary: provide additional insight for framework developers re: GPU_RESOURCES capability Key: MESOS-7375 URL: https://issues.apache.org/jira/browse/MESOS-7375 Project: Mesos Issue Type: Documentation Reporter: James DeFelice On clusters where all nodes are equal and every node has a GPU, frameworks that **don't** opt-in to the `GPU_RESOURCES` capability won't get any offers. This is surprising for operators. Even when a framework doesn't **need** GPU resources, it may make sense for a framework scheduler to provide a `--enable-gpu-compat` (or similar) flag that results in the framework advertising the `GPU_RESOURCES` capability even though it does not intend to consume any GPU. The effect being that said framework will now receive offers on clusters where all nodes have GPU resources. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (MESOS-7375) provide additional insight for framework developers re: GPU_RESOURCES capability
[ https://issues.apache.org/jira/browse/MESOS-7375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James DeFelice updated MESOS-7375: -- Description: On clusters where all nodes are equal and every node has a GPU, frameworks that **don't** opt-in to the `GPU_RESOURCES` capability won't get any offers. This is surprising for operators. Even when a framework doesn't **need** GPU resources, it may make sense for a framework scheduler to provide a `--gpu-cluster-compat` (or similar) flag that results in the framework advertising the `GPU_RESOURCES` capability even though it does not intend to consume any GPU. The effect being that said framework will now receive offers on clusters where all nodes have GPU resources. was: On clusters where all nodes are equal and every node has a GPU, frameworks that **don't** opt-in to the `GPU_RESOURCES` capability won't get any offers. This is surprising for operators. Even when a framework doesn't **need** GPU resources, it may make sense for a framework scheduler to provide a `--enable-gpu-compat` (or similar) flag that results in the framework advertising the `GPU_RESOURCES` capability even though it does not intend to consume any GPU. The effect being that said framework will now receive offers on clusters where all nodes have GPU resources. > provide additional insight for framework developers re: GPU_RESOURCES > capability > > > Key: MESOS-7375 > URL: https://issues.apache.org/jira/browse/MESOS-7375 > Project: Mesos > Issue Type: Documentation >Reporter: James DeFelice > Labels: mesosphere > > On clusters where all nodes are equal and every node has a GPU, frameworks > that **don't** opt-in to the `GPU_RESOURCES` capability won't get any offers. > This is surprising for operators. > Even when a framework doesn't **need** GPU resources, it may make sense for a > framework scheduler to provide a `--gpu-cluster-compat` (or similar) flag > that results in the framework advertising the `GPU_RESOURCES` capability even > though it does not intend to consume any GPU. The effect being that said > framework will now receive offers on clusters where all nodes have GPU > resources. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (MESOS-7411) Frameworks may specify alternate "stop" signal vs. SIGTERM
James DeFelice created MESOS-7411: - Summary: Frameworks may specify alternate "stop" signal vs. SIGTERM Key: MESOS-7411 URL: https://issues.apache.org/jira/browse/MESOS-7411 Project: Mesos Issue Type: Improvement Reporter: James DeFelice Normally Mesos sends a {{SIGTERM}} that escalates to {{SIGKILL}} when stopping a running process. Some apps handle {{SIGTERM}} differently and expose "stop" behavior via an alternate signal, for example {{SIGRTMIN+3}} (looking at you, systemd). It should be possible for a framework to specify an alternate "stop" signal via the Mesos API, which Mesos would then send to a process when attempting to stop it (before escalating to {{SIGKILL}}). -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (MESOS-7411) Frameworks may specify alternate "stop" signal vs. SIGTERM
[ https://issues.apache.org/jira/browse/MESOS-7411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15979231#comment-15979231 ] James DeFelice commented on MESOS-7411: --- Related Docker feature: https://docs.docker.com/engine/reference/commandline/run/#stop-container-with-signal---stop-signal > Frameworks may specify alternate "stop" signal vs. SIGTERM > -- > > Key: MESOS-7411 > URL: https://issues.apache.org/jira/browse/MESOS-7411 > Project: Mesos > Issue Type: Improvement >Reporter: James DeFelice > Labels: mesosphere > > Normally Mesos sends a {{SIGTERM}} that escalates to {{SIGKILL}} when > stopping a running process. Some apps handle {{SIGTERM}} differently and > expose "stop" behavior via an alternate signal, for example {{SIGRTMIN+3}} > (looking at you, systemd). It should be possible for a framework to specify > an alternate "stop" signal via the Mesos API, which Mesos would then send to > a process when attempting to stop it (before escalating to {{SIGKILL}}). -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Comment Edited] (MESOS-7271) JNI SIGSEGV failed when connecting spark to mesos master
[ https://issues.apache.org/jira/browse/MESOS-7271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16015769#comment-16015769 ] James DeFelice edited comment on MESOS-7271 at 5/18/17 1:46 PM: [~mgummelt] the stack trace is from an OpenJDK platform; is that what you're testing with? was (Author: jdef): [~mgummelt]the stack trace is from an OpenJDK platform; is that what you're testing with? > JNI SIGSEGV failed when connecting spark to mesos master > > > Key: MESOS-7271 > URL: https://issues.apache.org/jira/browse/MESOS-7271 > Project: Mesos > Issue Type: Bug > Components: java api >Affects Versions: 1.1.0, 1.2.0 > Environment: Ubuntu 16.04, OpenJDK 8, Spark 2.1.1 >Reporter: Qi Cui > > Run starting. Expected test count is: 1 > SampleDataFrameTest: > 17/03/20 11:53:16 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > WARNING: Logging before InitGoogleLogging() is written to STDERR > I0320 11:53:19.775842 4679 process.cpp:1071] libprocess is initialized on > 192.168.0.99:38293 with 8 worker threads > I0320 11:53:19.775975 4679 logging.cpp:199] Logging to STDERR > I0320 11:53:19.789871 4725 sched.cpp:226] Version: 1.1.0 > I0320 11:53:19.832826 4717 sched.cpp:330] New master detected at > master@192.168.0.50:5050 > I0320 11:53:19.838253 4717 sched.cpp:341] No credentials provided. > Attempting to register without authentication > I0320 11:53:19.838337 4717 sched.cpp:820] Sending SUBSCRIBE call to > master@192.168.0.50:5050 > I0320 11:53:19.840265 4717 sched.cpp:853] Will retry registration in > 32.354951ms if necessary > I0320 11:53:19.844734 4717 sched.cpp:743] Framework registered with > 6e147824-5d88-411b-9c09-a7137565c309-0001 > I0320 11:53:19.864850 4717 sched.cpp:757] Scheduler::registered took > 20.022604ms > ERROR: exception pending on entry to FindMesosClass() > # > # A fatal error has been detected by the Java Runtime Environment: > # > # SIGSEGV (0xb) at pc=0x7ffa06fea4a6, pid=4677, tid=0x7ff9a1a46700 > # > # JRE version: OpenJDK Runtime Environment (8.0_121-b13) (build > 1.8.0_121-8u121-b13-0ubuntu1.16.04.2-b13) > # Java VM: OpenJDK 64-Bit Server VM (25.121-b13 mixed mode linux-amd64 > compressed oops) > # Problematic frame: > # V [libjvm.so+0x6744a6] > # > # Failed to write core dump. Core dumps have been disabled. To enable core > dumping, try "ulimit -c unlimited" before starting Java again > # > # An error report file with more information is saved as: > # /media/sf_G_DRIVE/src/spark-testing-base/hs_err_pid4677.log > # > # If you would like to submit a bug report, please visit: > # http://bugreport.java.com/bugreport/crash.jsp > # -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (MESOS-7271) JNI SIGSEGV failed when connecting spark to mesos master
[ https://issues.apache.org/jira/browse/MESOS-7271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16015769#comment-16015769 ] James DeFelice commented on MESOS-7271: --- [~mgummelt]the stack trace is from an OpenJDK platform; is that what you're testing with? > JNI SIGSEGV failed when connecting spark to mesos master > > > Key: MESOS-7271 > URL: https://issues.apache.org/jira/browse/MESOS-7271 > Project: Mesos > Issue Type: Bug > Components: java api >Affects Versions: 1.1.0, 1.2.0 > Environment: Ubuntu 16.04, OpenJDK 8, Spark 2.1.1 >Reporter: Qi Cui > > Run starting. Expected test count is: 1 > SampleDataFrameTest: > 17/03/20 11:53:16 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > WARNING: Logging before InitGoogleLogging() is written to STDERR > I0320 11:53:19.775842 4679 process.cpp:1071] libprocess is initialized on > 192.168.0.99:38293 with 8 worker threads > I0320 11:53:19.775975 4679 logging.cpp:199] Logging to STDERR > I0320 11:53:19.789871 4725 sched.cpp:226] Version: 1.1.0 > I0320 11:53:19.832826 4717 sched.cpp:330] New master detected at > master@192.168.0.50:5050 > I0320 11:53:19.838253 4717 sched.cpp:341] No credentials provided. > Attempting to register without authentication > I0320 11:53:19.838337 4717 sched.cpp:820] Sending SUBSCRIBE call to > master@192.168.0.50:5050 > I0320 11:53:19.840265 4717 sched.cpp:853] Will retry registration in > 32.354951ms if necessary > I0320 11:53:19.844734 4717 sched.cpp:743] Framework registered with > 6e147824-5d88-411b-9c09-a7137565c309-0001 > I0320 11:53:19.864850 4717 sched.cpp:757] Scheduler::registered took > 20.022604ms > ERROR: exception pending on entry to FindMesosClass() > # > # A fatal error has been detected by the Java Runtime Environment: > # > # SIGSEGV (0xb) at pc=0x7ffa06fea4a6, pid=4677, tid=0x7ff9a1a46700 > # > # JRE version: OpenJDK Runtime Environment (8.0_121-b13) (build > 1.8.0_121-8u121-b13-0ubuntu1.16.04.2-b13) > # Java VM: OpenJDK 64-Bit Server VM (25.121-b13 mixed mode linux-amd64 > compressed oops) > # Problematic frame: > # V [libjvm.so+0x6744a6] > # > # Failed to write core dump. Core dumps have been disabled. To enable core > dumping, try "ulimit -c unlimited" before starting Java again > # > # An error report file with more information is saved as: > # /media/sf_G_DRIVE/src/spark-testing-base/hs_err_pid4677.log > # > # If you would like to submit a bug report, please visit: > # http://bugreport.java.com/bugreport/crash.jsp > # -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (MESOS-6512) Add support for cgroups hugetlb subsystem
[ https://issues.apache.org/jira/browse/MESOS-6512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16015841#comment-16015841 ] James DeFelice commented on MESOS-6512: --- basic isolator support for this seems to have landed: https://github.com/apache/mesos/commit/b495fda02566ee6e47ac5a618a6b14ff556e0d76 > Add support for cgroups hugetlb subsystem > - > > Key: MESOS-6512 > URL: https://issues.apache.org/jira/browse/MESOS-6512 > Project: Mesos > Issue Type: Task >Reporter: haosdent >Assignee: haosdent > -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (MESOS-6512) Add support for cgroups hugetlb subsystem
[ https://issues.apache.org/jira/browse/MESOS-6512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16015844#comment-16015844 ] James DeFelice commented on MESOS-6512: --- k8s is working on something similar https://github.com/kubernetes/kubernetes/pull/44817 > Add support for cgroups hugetlb subsystem > - > > Key: MESOS-6512 > URL: https://issues.apache.org/jira/browse/MESOS-6512 > Project: Mesos > Issue Type: Task >Reporter: haosdent >Assignee: haosdent > -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (MESOS-7523) Whitelist devices in bulk on a per-container basis
James DeFelice created MESOS-7523: - Summary: Whitelist devices in bulk on a per-container basis Key: MESOS-7523 URL: https://issues.apache.org/jira/browse/MESOS-7523 Project: Mesos Issue Type: Bug Reporter: James DeFelice Continuation of the work in MESOS-6791 It should be possible to whitelist a range (R) of devices such that R may be exposed to a container launched by an agent. Not all containers should have access to R by default, only those containers whose ContainerInfo specifies such access. For example, it may be useful to whitelist the range of devices matching the glob expressions `/dev/{s,h,xv}d[a-z]*` and `/dev/dm-*` and `/dev/mapper/*` for a container that intends to manage storage devices. /cc [~jieyu] -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (MESOS-7523) Whitelist devices in bulk on a per-container basis
[ https://issues.apache.org/jira/browse/MESOS-7523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James DeFelice updated MESOS-7523: -- Description: Continuation of the work in MESOS-6791 It should be possible to whitelist a range (R) of devices such that R may be exposed to a container launched by an agent. Not all containers should have access to R by default, only those containers whose ContainerInfo specifies such access. For example, it may be useful to whitelist the range of devices matching the glob expressions `/dev/\{s,h,xv}d\[a-z]*` and `/dev/dm-\*` and `/dev/mapper/\*` for a container that intends to manage storage devices. /cc [~jieyu] was: Continuation of the work in MESOS-6791 It should be possible to whitelist a range (R) of devices such that R may be exposed to a container launched by an agent. Not all containers should have access to R by default, only those containers whose ContainerInfo specifies such access. For example, it may be useful to whitelist the range of devices matching the glob expressions `/dev/{s,h,xv}d[a-z]*` and `/dev/dm-*` and `/dev/mapper/*` for a container that intends to manage storage devices. /cc [~jieyu] > Whitelist devices in bulk on a per-container basis > -- > > Key: MESOS-7523 > URL: https://issues.apache.org/jira/browse/MESOS-7523 > Project: Mesos > Issue Type: Bug >Reporter: James DeFelice > Labels: mesosphere > > Continuation of the work in MESOS-6791 > It should be possible to whitelist a range (R) of devices such that R may be > exposed to a container launched by an agent. Not all containers should have > access to R by default, only those containers whose ContainerInfo specifies > such access. > For example, it may be useful to whitelist the range of devices matching the > glob expressions `/dev/\{s,h,xv}d\[a-z]*` and `/dev/dm-\*` and > `/dev/mapper/\*` for a container that intends to manage storage devices. > /cc [~jieyu] -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (MESOS-5388) MesosContainerizerLaunch flags execute arbitrary commands via shell
James DeFelice created MESOS-5388: - Summary: MesosContainerizerLaunch flags execute arbitrary commands via shell Key: MESOS-5388 URL: https://issues.apache.org/jira/browse/MESOS-5388 Project: Mesos Issue Type: Bug Reporter: James DeFelice For example, the docker volume isolator's containerPath is appended (without sanitation) to a command that's executed in this manner. As such, it's possible to inject arbitrary shell commands to be executed by mesos. https://github.com/apache/mesos/blob/17260204c833c643adf3d8f36ad8a1a606ece809/src/slave/containerizer/mesos/launch.cpp#L206 Perhaps instead of strings these commands could/should be sent as string arrays that could be passed as argv arguments w/o shell interpretation? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-5389) docker containerizer should prefix relative volume.container_path values with the path to the sandbox
James DeFelice created MESOS-5389: - Summary: docker containerizer should prefix relative volume.container_path values with the path to the sandbox Key: MESOS-5389 URL: https://issues.apache.org/jira/browse/MESOS-5389 Project: Mesos Issue Type: Bug Reporter: James DeFelice docker containerizer currently requires absolute paths for values of volume.container_path. this is inconsistent with the mesos containerizer which requires relative container_path. it makes for a confusing API. both at the Mesos level as well as at the Marathon level. ideally the docker containerizer would allow a framework to specify a relative path for volume.container_path and in such cases automatically convert it to an absolute path by prepending the sandbox directory to it. /cc [~jieyu] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5389) docker containerizer should prefix relative volume.container_path values with the path to the sandbox
[ https://issues.apache.org/jira/browse/MESOS-5389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15286383#comment-15286383 ] James DeFelice commented on MESOS-5389: --- Disagree. The containerizers should both being able to accept relative containerPath's otherwise the API's become confusing for the end user. The user doesn't always know the path to the sandbox that Mesos (or dc/os) will append so asking the user to specify an absolute containerPath (if they want the vol mounted in their sandbox) is a non-starter. -- James DeFelice 585.241.9488 (voice) 650.649.6071 (fax) > docker containerizer should prefix relative volume.container_path values with > the path to the sandbox > - > > Key: MESOS-5389 > URL: https://issues.apache.org/jira/browse/MESOS-5389 > Project: Mesos > Issue Type: Bug >Reporter: James DeFelice > Labels: docker, mesosphere, storage, volumes > > docker containerizer currently requires absolute paths for values of > volume.container_path. this is inconsistent with the mesos containerizer > which requires relative container_path. it makes for a confusing API. both at > the Mesos level as well as at the Marathon level. > ideally the docker containerizer would allow a framework to specify a > relative path for volume.container_path and in such cases automatically > convert it to an absolute path by prepending the sandbox directory to it. > /cc [~jieyu] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5389) docker containerizer should prefix relative volume.container_path values with the path to the sandbox
[ https://issues.apache.org/jira/browse/MESOS-5389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15286386#comment-15286386 ] James DeFelice commented on MESOS-5389: --- s/will append/will use/ > docker containerizer should prefix relative volume.container_path values with > the path to the sandbox > - > > Key: MESOS-5389 > URL: https://issues.apache.org/jira/browse/MESOS-5389 > Project: Mesos > Issue Type: Bug >Reporter: James DeFelice > Labels: docker, mesosphere, storage, volumes > > docker containerizer currently requires absolute paths for values of > volume.container_path. this is inconsistent with the mesos containerizer > which requires relative container_path. it makes for a confusing API. both at > the Mesos level as well as at the Marathon level. > ideally the docker containerizer would allow a framework to specify a > relative path for volume.container_path and in such cases automatically > convert it to an absolute path by prepending the sandbox directory to it. > /cc [~jieyu] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-5537) http v1 SUBSCRIBED scheduler event always has nil http_interval_seconds
James DeFelice created MESOS-5537: - Summary: http v1 SUBSCRIBED scheduler event always has nil http_interval_seconds Key: MESOS-5537 URL: https://issues.apache.org/jira/browse/MESOS-5537 Project: Mesos Issue Type: Bug Reporter: James DeFelice Fix For: 1.0.0 I'm writing a controller in Go to monitor heartbeats. I'd like to use the interval as communicated by the master, which should be specified in the SUBSCRIBED event. But it's not. {code} 2016/06/03 18:34:04 {Type:SUBSCRIBED Subscribed:&Event_Subscribed{FrameworkID:&mesos.FrameworkID{Value:ffdb6d6e-0167-4fa2-98f9-2c3f8157fc25-0004,},HeartbeatIntervalSeconds:nil,} Offers:nil Rescind:nil Update:nil Message:nil Failure:nil Error:nil} {code} {code} $ dpkg -l |grep -e mesos ii mesos 0.28.0-2.0.16.ubuntu1404 amd64 Cluster resource manager with efficient resource isolation {code} I *am* seeing HEARTBEAT events. Just not seeing the interval specified in the SUBSCRIBED event. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-5853) http v1 API should document behavior regarding generated content-type header in the presence of errors
James DeFelice created MESOS-5853: - Summary: http v1 API should document behavior regarding generated content-type header in the presence of errors Key: MESOS-5853 URL: https://issues.apache.org/jira/browse/MESOS-5853 Project: Mesos Issue Type: Bug Components: documentation Reporter: James DeFelice Changes made as part of https://issues.apache.org/jira/browse/MESOS-3739 set a default Content-Type header. This should be documented in the Mesos v1 HTTP API literature so that devs implementing against the spec know what to expect. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-5899) docker containerizer should log a warning when docker_volume.driver_options are specified
James DeFelice created MESOS-5899: - Summary: docker containerizer should log a warning when docker_volume.driver_options are specified Key: MESOS-5899 URL: https://issues.apache.org/jira/browse/MESOS-5899 Project: Mesos Issue Type: Improvement Components: docker Reporter: James DeFelice currently the docker containerizer ignores the values of docker_volume.driver_options which could be confusing to framework devs trying to use docker volume plugins with the docker containerizer. the docker containerizer should probably log a warning that it's ignoring driver_options if they're present in the docker_volume protobuf. it might also be a good idea to document this limitation in the protobuf (.proto) files themselves. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-8237) Strip Resource.allocation_info for non-MULTI_ROLE schedulers.
[ https://issues.apache.org/jira/browse/MESOS-8237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16273358#comment-16273358 ] James DeFelice commented on MESOS-8237: --- [~bmahler] I don't foresee an issue w/ Offer.allocation_info related to mesos-go. I can't speak for other Mesos library maintainers. > Strip Resource.allocation_info for non-MULTI_ROLE schedulers. > - > > Key: MESOS-8237 > URL: https://issues.apache.org/jira/browse/MESOS-8237 > Project: Mesos > Issue Type: Bug >Reporter: James DeFelice >Assignee: Benjamin Mahler > Labels: mesosphere > > In support of MULTI_ROLE capable frameworks, a Resource.allocation_info field > was added and the Resource math of the Mesos library was updated to check for > matching allocation_info when checking for (in)equality, addability, > subtractability, containment, etc. To compensate for these changes, the demo > frameworks of Mesos were updated to set the allocation_info for Resource > objects during the "matching phase" in which offers' resources are evaluated > in order for the framework to launch tasks. The Mesos demo frameworks NEEDED > to be updated because the Resource algebra within Mesos now depended on > matching allocation_info fields of Resource objects when executing algebraic > operations. See > https://github.com/apache/mesos/commit/c20744a9976b5e83698e9c6062218abb4d2e6b25#diff-298cc6a77862b7ff3422cd06c215ef28R91 > . > This poses a unique problem for **external** libraries that both aim to > support various frameworks, some that DO and some that DO NOT opt-in to the > MULTI_ROLE capability; specifically those external libraries that implement > Resource algebra that's consistent with what Mesos implements internally. One > such example of a library is mesos-go, though there are undoubtedly others. > The problem can be explained via this scenario: > {quote} > Flo's mesos-go framework is running well, it doesn't opt-in to MULTI_ROLE > because it doesn't need multiple roles. His framework runs on a version of > Mesos that existed prior to integration of MULTI_ROLE support. His DC > operator upgrades the mesos cluster to the latest version. Flo rebuilds his > framework on the latest version of mesos-go and re-launches it on the > cluster. He observes that his framework receives offers, but rejects ALL of > them. Digging into the code he realizes that Mesos is injecting > allocation_info into Resource objects being offered to his framework, and > mesos-go considers allocation_info when comparing Resource objects (because > it's MULTI_ROLE compatible now), but his framework doesn't take this into > consideration when preparing its own Resource objects prior to the "resource > matching phase". The consequence is that Flo's framework is trying to match > against Resources that will never align because his framework isn't setting > an allocation_info that might possibly match the allocation_info that Mesos > is always injecting - regardless of the MULTI_ROLE capability (or lack > thereof in this case) of his framework. > {quote} > If Mesos were to strip the allocation_info from Resource objects, prior to > offering them to non-multi-role frameworks, then the problem illustrated > above would go away. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-8244) Add operator API to reload local resource providers.
[ https://issues.apache.org/jira/browse/MESOS-8244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16285458#comment-16285458 ] James DeFelice commented on MESOS-8244: --- It looks to have landed here: https://reviews.apache.org/r/63901/ but there's a bug in the API (I left a comment on the review) > Add operator API to reload local resource providers. > > > Key: MESOS-8244 > URL: https://issues.apache.org/jira/browse/MESOS-8244 > Project: Mesos > Issue Type: Task >Reporter: Chun-Hung Hsiao >Assignee: Chun-Hung Hsiao > Labels: mesosphere, storage > > To add, remove and update local resource providers on the fly more > conveniently and without restarting agents, we would like to introduce new > operator API to add new config files in the resource provider config > directory and trigger a reload for the resource provider. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-8342) Make the Docker containerizer exhibit the same behavior as Mesos/UCR which sets `memory.memsw.limit_in_bytes` to be equal to `memory.limit_in_bytes` when `MESOS_CGROUPS_L
[ https://issues.apache.org/jira/browse/MESOS-8342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James DeFelice updated MESOS-8342: -- Labels: mesosphere (was: ) > Make the Docker containerizer exhibit the same behavior as Mesos/UCR which > sets `memory.memsw.limit_in_bytes` to be equal to `memory.limit_in_bytes` > when `MESOS_CGROUPS_LIMIT_SWAP=true` > - > > Key: MESOS-8342 > URL: https://issues.apache.org/jira/browse/MESOS-8342 > Project: Mesos > Issue Type: Improvement > Components: containerization, docker >Affects Versions: 1.4.1 >Reporter: Vishnu Mohan > Labels: mesosphere > > Please add support for the functionality afforded by the {{\--memory-swap}} > and {{\--memory-swappiness}} {{docker run}} options to the Docker > Containerizer: > https://github.com/apache/mesos/blob/1.4.x/src/docker/docker.hpp#L193-L194 > ATM the Docker containerizer does not honor > {{MESOS_CGROUPS_LIMIT_SWAP=true}}, and depending on the OS {{swappiness}} > configuration, the Docker Engine will (typically) set > {{memory.memsw.limit_in_bytes}} to twice the value of > {{memory.limit_in_bytes}} > This means that all Docker containers can/will swap up to 2x their allocation > before being OOM-killed by the Docker Engine which can cause a huge > performance problem. > The only real workaround, for now, is to ensure that all apps that are > launched by the Docker containerizer (at least those that are launched via > Marathon) also pass {{\--memory-swap=}} and/or pass > {{\--memory-swappiness=0}}, depending on the version of the Docker Engine, > ({{docker run --help}}), as arbitrary Docker params, assuming the scheduler > supports it, which is operationally cumbersome. > Ideally, the Docker containerizer would exhibit the same behavior as > Mesos/UCR which sets {{memory.memsw.limit_in_bytes}} to be equal to > {{memory.limit_in_bytes}} when {{MESOS_CGROUPS_LIMIT_SWAP=true}} > Ref: > https://docs.docker.com/engine/admin/resource_constraints/#prevent-a-container-from-using-swap -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (MESOS-8693) agent: update_resource_provider w/ identical RP info should not always force-restart plugin
James DeFelice created MESOS-8693: - Summary: agent: update_resource_provider w/ identical RP info should not always force-restart plugin Key: MESOS-8693 URL: https://issues.apache.org/jira/browse/MESOS-8693 Project: Mesos Issue Type: Task Affects Versions: 1.5.0 Reporter: James DeFelice Currently when the UPDATE_RESOURCE_PROVIDER call is sent to an agent, and the RP info of the request is identical to that of the running configuration, the agent force-restarts the related CSI plugin. This is surprising on two accounts: First, because it increases the complexity of the client that wants to ensure the latest RP configuration is pushed to the agent. A CSI plugin may take a long time to become ready after being reconfigured. It's likely that a caller will experience a timeout while waiting for the RP to come into a healthy state w/ the desired configuration. Upon retrying the update, a client DOES NOT always wish to restart an ongoing reconfiguration effort – especially when for long running reconfiguration operations. Mesos should NOT restart the related CSI plugin by default if the new RP info matches the existing one, and instead should either return 409 or some other, more appropriate error code (409 would be nice/consistent, see below). Second, because it differs from the idempotent nature of the ADD_RESOURCE_PROVIDER call, which does NOT change the state of the plugin in case of a duplicate request. The ADD_RESOURCE_PROVIDER call returns a 409 response, which allows callers to simply re-issue redundant requests without concern for interrupting the state of a running plugin. In the event that caller DOES want to force the restart of an underlying CSI plugin, I suggest that we extend the UPDATE_RESOURCE_PROVIDER call w/ a "force_restart" field (sibling to the "info" field). "force_restart == true" would only have meaning for updates that involve unchanged RP info, otherwise it would go unused. /cc [~jieyu] [~chhsia0] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-7697) Mesos scheduler v1 HTTP API may generate 404 errors for temporary conditions
[ https://issues.apache.org/jira/browse/MESOS-7697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16415941#comment-16415941 ] James DeFelice commented on MESOS-7697: --- https://github.com/apache/mesos/blob/124c677c86c7b12ca4568f004895b8ca30d60dcf/3rdparty/libprocess/src/process.cpp#L3674 > Mesos scheduler v1 HTTP API may generate 404 errors for temporary conditions > > > Key: MESOS-7697 > URL: https://issues.apache.org/jira/browse/MESOS-7697 > Project: Mesos > Issue Type: Bug > Components: HTTP API, libprocess >Reporter: James DeFelice >Priority: Major > Labels: mesosphere > > Returning a 404 error for a condition that's a known temporary condition is > confusing from a client's perspective. A client wants to know how to recover > from various error conditions. A 404 error condition should be distinct from > a "server is not yet ready, but will be shortly" condition (which should > probably be reported as a 503 "unavailable" error). > https://github.com/apache/mesos/blob/72752fc6deb8ebcbfbd5448dc599ef3774339d31/src/scheduler/scheduler.cpp#L593 > {code} > if (response->code == process::http::Status::NOT_FOUND) { > // This could happen if the master libprocess process has not yet set up > // HTTP routes. > LOG(WARNING) << "Received '" << response->status << "' (" ><< response->body << ") for " << call.type(); > return; > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (MESOS-7697) Mesos scheduler v1 HTTP API may generate 404 errors for temporary conditions
[ https://issues.apache.org/jira/browse/MESOS-7697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16415941#comment-16415941 ] James DeFelice edited comment on MESOS-7697 at 3/27/18 5:13 PM: [https://github.com/apache/mesos/blob/124c677c86c7b12ca4568f004895b8ca30d60dcf/3rdparty/libprocess/src/process.cpp#L3674] https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/process.cpp#L2729-L2740 was (Author: jdef): https://github.com/apache/mesos/blob/124c677c86c7b12ca4568f004895b8ca30d60dcf/3rdparty/libprocess/src/process.cpp#L3674 > Mesos scheduler v1 HTTP API may generate 404 errors for temporary conditions > > > Key: MESOS-7697 > URL: https://issues.apache.org/jira/browse/MESOS-7697 > Project: Mesos > Issue Type: Bug > Components: HTTP API, libprocess >Reporter: James DeFelice >Priority: Major > Labels: mesosphere > > Returning a 404 error for a condition that's a known temporary condition is > confusing from a client's perspective. A client wants to know how to recover > from various error conditions. A 404 error condition should be distinct from > a "server is not yet ready, but will be shortly" condition (which should > probably be reported as a 503 "unavailable" error). > https://github.com/apache/mesos/blob/72752fc6deb8ebcbfbd5448dc599ef3774339d31/src/scheduler/scheduler.cpp#L593 > {code} > if (response->code == process::http::Status::NOT_FOUND) { > // This could happen if the master libprocess process has not yet set up > // HTTP routes. > LOG(WARNING) << "Received '" << response->status << "' (" ><< response->body << ") for " << call.type(); > return; > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (MESOS-7697) Mesos scheduler v1 HTTP API may generate 404 errors for temporary conditions
[ https://issues.apache.org/jira/browse/MESOS-7697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16415941#comment-16415941 ] James DeFelice edited comment on MESOS-7697 at 3/27/18 5:14 PM: [https://github.com/apache/mesos/blob/124c677c86c7b12ca4568f004895b8ca30d60dcf/3rdparty/libprocess/src/process.cpp#L3674] https://github.com/apache/mesos/blob/124c677c86c7b12ca4568f004895b8ca30d60dcf/3rdparty/libprocess/src/process.cpp#L2729-L2740 was (Author: jdef): [https://github.com/apache/mesos/blob/124c677c86c7b12ca4568f004895b8ca30d60dcf/3rdparty/libprocess/src/process.cpp#L3674] https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/process.cpp#L2729-L2740 > Mesos scheduler v1 HTTP API may generate 404 errors for temporary conditions > > > Key: MESOS-7697 > URL: https://issues.apache.org/jira/browse/MESOS-7697 > Project: Mesos > Issue Type: Bug > Components: HTTP API, libprocess >Reporter: James DeFelice >Priority: Major > Labels: mesosphere > > Returning a 404 error for a condition that's a known temporary condition is > confusing from a client's perspective. A client wants to know how to recover > from various error conditions. A 404 error condition should be distinct from > a "server is not yet ready, but will be shortly" condition (which should > probably be reported as a 503 "unavailable" error). > https://github.com/apache/mesos/blob/72752fc6deb8ebcbfbd5448dc599ef3774339d31/src/scheduler/scheduler.cpp#L593 > {code} > if (response->code == process::http::Status::NOT_FOUND) { > // This could happen if the master libprocess process has not yet set up > // HTTP routes. > LOG(WARNING) << "Received '" << response->status << "' (" ><< response->body << ") for " << call.type(); > return; > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-4065) slave FD for ZK tcp connection leaked to executor process
[ https://issues.apache.org/jira/browse/MESOS-4065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16448611#comment-16448611 ] James DeFelice commented on MESOS-4065: --- It looks like the linked ZK ticket was recently resolved. > slave FD for ZK tcp connection leaked to executor process > - > > Key: MESOS-4065 > URL: https://issues.apache.org/jira/browse/MESOS-4065 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.24.1, 0.25.0, 1.2.2 >Reporter: James DeFelice >Priority: Major > Labels: mesosphere, security > > {code} > core@ip-10-0-0-45 ~ $ ps auxwww|grep -e etcd > root 1432 99.3 0.0 202420 12928 ?Rsl 21:32 13:51 > ./etcd-mesos-executor -log_dir=./ > root 1450 0.4 0.1 38332 28752 ?Sl 21:32 0:03 ./etcd > --data-dir=etcd_data --name=etcd-1449178273 > --listen-peer-urls=http://10.0.0.45:1025 > --initial-advertise-peer-urls=http://10.0.0.45:1025 > --listen-client-urls=http://10.0.0.45:1026 > --advertise-client-urls=http://10.0.0.45:1026 > --initial-cluster=etcd-1449178273=http://10.0.0.45:1025,etcd-1449178271=http://10.0.2.95:1025,etcd-1449178272=http://10.0.2.216:1025 > --initial-cluster-state=existing > core 1651 0.0 0.0 6740 928 pts/0S+ 21:46 0:00 grep > --colour=auto -e etcd > core@ip-10-0-0-45 ~ $ sudo lsof -p 1432|grep -e 2181 > etcd-meso 1432 root 10u IPv4 21973 0t0TCP > ip-10-0-0-45.us-west-2.compute.internal:54016->ip-10-0-5-206.us-west-2.compute.internal:2181 > (ESTABLISHED) > core@ip-10-0-0-45 ~ $ ps auxwww|grep -e slave > root 1124 0.2 0.1 900496 25736 ?Ssl 21:11 0:04 > /opt/mesosphere/packages/mesos--52cbecde74638029c3ba0ac5e5ab81df8debf0fa/sbin/mesos-slave > core 1658 0.0 0.0 6740 832 pts/0S+ 21:46 0:00 grep > --colour=auto -e slave > core@ip-10-0-0-45 ~ $ sudo lsof -p 1124|grep -e 2181 > mesos-sla 1124 root 10u IPv4 21973 0t0TCP > ip-10-0-0-45.us-west-2.compute.internal:54016->ip-10-0-5-206.us-west-2.compute.internal:2181 > (ESTABLISHED) > {code} > I only tested against mesos 0.24.1 and 0.25.0. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-8967) Comments for FaultDomain should include notes for convention pertaining to additional hierarchy.
James DeFelice created MESOS-8967: - Summary: Comments for FaultDomain should include notes for convention pertaining to additional hierarchy. Key: MESOS-8967 URL: https://issues.apache.org/jira/browse/MESOS-8967 Project: Mesos Issue Type: Task Reporter: James DeFelice The original design doc includes conventions for additional hierarchy. This commentary is missing from the protobuf and so it's easily missed. https://docs.google.com/document/d/1gEugdkLRbBsqsiFv3urRPRNrHwUC-i1HwfFfHR_MvC8/edit#heading=h.emfys1xszpir -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9238) rpmbuild checkfiles fails
James DeFelice created MESOS-9238: - Summary: rpmbuild checkfiles fails Key: MESOS-9238 URL: https://issues.apache.org/jira/browse/MESOS-9238 Project: Mesos Issue Type: Bug Reporter: James DeFelice I noticed that Mesos nightly builds haven't been pushed to dockerhub in a while. After some help from Jie and digging a bit more it looks like rpm-build is reporting an error: {code:java} RPM build errors: error: Installed (but unpackaged) file(s) found: /usr/include/rapidjson/allocators.h /usr/include/rapidjson/document.h /usr/include/rapidjson/encodedstream.h /usr/include/rapidjson/encodings.h /usr/include/rapidjson/error/en.h /usr/include/rapidjson/error/error.h /usr/include/rapidjson/filereadstream.h /usr/include/rapidjson/filewritestream.h /usr/include/rapidjson/fwd.h /usr/include/rapidjson/internal/biginteger.h /usr/include/rapidjson/internal/diyfp.h /usr/include/rapidjson/internal/dtoa.h /usr/include/rapidjson/internal/ieee754.h /usr/include/rapidjson/internal/itoa.h /usr/include/rapidjson/internal/meta.h /usr/include/rapidjson/internal/pow10.h /usr/include/rapidjson/internal/regex.h /usr/include/rapidjson/internal/stack.h /usr/include/rapidjson/internal/strfunc.h /usr/include/rapidjson/internal/strtod.h /usr/include/rapidjson/internal/swap.h /usr/include/rapidjson/istreamwrapper.h /usr/include/rapidjson/memorybuffer.h /usr/include/rapidjson/memorystream.h /usr/include/rapidjson/msinttypes/inttypes.h /usr/include/rapidjson/msinttypes/stdint.h /usr/include/rapidjson/ostreamwrapper.h /usr/include/rapidjson/pointer.h /usr/include/rapidjson/prettywriter.h /usr/include/rapidjson/rapidjson.h /usr/include/rapidjson/reader.h /usr/include/rapidjson/schema.h /usr/include/rapidjson/stream.h /usr/include/rapidjson/stringbuffer.h /usr/include/rapidjson/writer.h Macro %MESOS_VERSION has empty body Macro %MESOS_RELEASE has empty body {code} Furthermore, the cleanup func that's invoked by the trap is failing with a bunch of permission erors: {code:java} cleanup rm: cannot remove '/home/jenkins/jenkins-slave/workspace/Mesos-Docker-CentOS/centos7/.cache': Permission denied rm: cannot remove '/home/jenkins/jenkins-slave/workspace/Mesos-Docker-CentOS/centos7/rpmbuild/SRPMS': Permission denied rm: cannot remove '/home/jenkins/jenkins-slave/workspace/Mesos-Docker-CentOS/centos7/rpmbuild/BUILDROOT/mesos-1.8.0-0.1.pre.20180915git4805a47.el7.x86_64/var/lib/mesos': Permission denied rm: cannot remove '/home/jenkins/jenkins-slave/workspace/Mesos-Docker-CentOS/centos7/rpmbuild/BUILDROOT/mesos-1.8.0-0.1.pre.20180915git4805a47.el7.x86_64/var/log/mesos': Permission denied ... {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9293) OperationStatus messages sent to framework should include both agent ID and resource provider ID
James DeFelice created MESOS-9293: - Summary: OperationStatus messages sent to framework should include both agent ID and resource provider ID Key: MESOS-9293 URL: https://issues.apache.org/jira/browse/MESOS-9293 Project: Mesos Issue Type: Bug Affects Versions: 1.7.0 Reporter: James DeFelice Normally, frameworks are expected to checkpoint agent ID and resource provider ID before accepting an offer with an OfferOperation. From this expectation comes the requirement in the v1 scheduler API that a framework must provide the agent ID and resource provider ID when acknowledging an offer operation status update. However, this expectation breaks down: 1. the framework might lose its checkpointed data; it no longer remembers the agent ID or the resource provider ID 2. even if the framework checkpoints data, it could be sent a stale update: maybe the original ACK it sent to Mesos was lost, and it needs to re-ACK. If a framework deleted its checkpointed data after sending the ACK (that's dropped) then upon replay of the status update it no longer has the agent ID or resource provider ID for the operation. An easy remedy would be to add the agent ID and resource provider ID to the OperationStatus message received by the scheduler so that a framework can build a proper ACK for the update, even if it doesn't have access to its previously checkpointed information. I'm filing this a BUG because there's no way to reliably use the offer operation status API until this has been fixed. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9223) Storage local provider does not sufficiently handle container launch failures or errors
[ https://issues.apache.org/jira/browse/MESOS-9223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16640285#comment-16640285 ] James DeFelice commented on MESOS-9223: --- Regardless of whether retries are implemented, it would be nice to have an API that exposed the reason for the error. e.g. the last log line, or Mesos error related to the container failure. > Storage local provider does not sufficiently handle container launch failures > or errors > --- > > Key: MESOS-9223 > URL: https://issues.apache.org/jira/browse/MESOS-9223 > Project: Mesos > Issue Type: Improvement > Components: agent, storage >Reporter: Benjamin Bannier >Assignee: Chun-Hung Hsiao >Priority: Blocker > > The storage local resource provider as currently implemented does not handle > launch failures or task errors of its standalone containers well enough, If > e.g., a RP container fails to come up during node start a warning would be > logged, but an operator still needs to detect degraded functionality, > manually check the state of containers with {{GET_CONTAINERS}}, and decide > whether the agent needs restarting; I suspect they do not have always have > enough context for this decision. It would be better if the provider would > either enforce a restart by failing over the whole agent, or by retrying the > operation (optionally: up to some maximum amount of retries). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9308) URI disk profile adaptor could deadlock.
[ https://issues.apache.org/jira/browse/MESOS-9308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16646563#comment-16646563 ] James DeFelice commented on MESOS-9308: --- I'm wondering why this wasn't caught by a unit test. Maybe we need more unit tests around helpers like this. Tech debt? > URI disk profile adaptor could deadlock. > > > Key: MESOS-9308 > URL: https://issues.apache.org/jira/browse/MESOS-9308 > Project: Mesos > Issue Type: Bug > Components: storage >Affects Versions: 1.5.1, 1.6.1, 1.7.0 >Reporter: Jie Yu >Assignee: Chun-Hung Hsiao >Priority: Critical > Labels: mesosphere, storage > Fix For: 1.5.2, 1.6.2, 1.7.1, 1.8.0 > > > The loop here can be infinit: > https://github.com/apache/mesos/blob/1.7.0/src/resource_provider/storage/uri_disk_profile_adaptor.cpp#L61-L80 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9313) Document speculative offer operation semantics for framework writers.
James DeFelice created MESOS-9313: - Summary: Document speculative offer operation semantics for framework writers. Key: MESOS-9313 URL: https://issues.apache.org/jira/browse/MESOS-9313 Project: Mesos Issue Type: Documentation Reporter: James DeFelice It recently came to my attention that a subset of offer operations (e.g. RESERVE, UNRESERVE, et al.) are implemented speculatively within mesos master. Meaning that the master will apply the resource conversion internally **before** the conversion is checkpointed on the agent. The master may then re-offer the converted resource to a framework -- even though the agent may still not have checkpointed the resource conversion. If the checkpointing process on the agent fails, then subsequent operations issued for the falsely-offered resource will fail. Because the master essentially "lied" to the framework about the true state of the supposedly-converted resource. It's also been explained to me that this case is expected to be rare. However, it *can* impact the design/implementation of framework state machines and so it's critical that this information be documented clearly - outside of the C++ code base. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9318) Consider providing better operation status updates while an RP is recovering
[ https://issues.apache.org/jira/browse/MESOS-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16650756#comment-16650756 ] James DeFelice commented on MESOS-9318: --- Yes to this. It's pretty annoying to deal with this kind of UKNOWN otherwise. > Consider providing better operation status updates while an RP is recovering > > > Key: MESOS-9318 > URL: https://issues.apache.org/jira/browse/MESOS-9318 > Project: Mesos > Issue Type: Task >Affects Versions: 1.6.0, 1.7.0 >Reporter: Gastón Kleiman >Priority: Major > Labels: mesosphere, operation-feedback > > Consider the following scenario: > 1. A framework accepts an offer with an operation affecting SLRP resources. > 2. The master forwards it to the corresponding agent. > 3. The agent forwards it to the corresponding RP. > 4. The agent and the master fail over. > 5. The master recovers. > 6. The agent recovers while the RP is still recovering, so it doesn't include > the pending operation on the {{RegisterMessage}}. > 7. A framework performs an explicit operation status reconciliation. > In this case the master will currently respond with {{OPERATION_UNKNOWN}}, > but it should be possible to respond with a more fine-grained and useful > state, such as {{OPERATION_RECOVERING}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9352) Data in persistent volume deleted accidentally when using Docker container and Persistent volume
[ https://issues.apache.org/jira/browse/MESOS-9352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16662650#comment-16662650 ] James DeFelice commented on MESOS-9352: --- Seemingly related MESOS-9049, MESOS-8830, MESOS-2408 > Data in persistent volume deleted accidentally when using Docker container > and Persistent volume > > > Key: MESOS-9352 > URL: https://issues.apache.org/jira/browse/MESOS-9352 > Project: Mesos > Issue Type: Bug > Components: agent, containerization, docker >Affects Versions: 1.5.1, 1.5.2 > Environment: DCOS 1.11.6 > Mesos 1.5.2 >Reporter: David Ko >Priority: Critical > Labels: mesosphere, persistent-volumes > Attachments: image-2018-10-24-22-20-51-059.png, > image-2018-10-24-22-21-13-399.png > > > Using docker image w/ persistent volume to start a service, it will cause > data in persistent volume deleted accidentally when task killed and > restarted, also old mount points not unmounted, even the service already > deleted. > *The expected result should be data in persistent volume kept until task > deleted completely, also dangling mount points should be unmounted correctly.* > > *Step 1:* Use below JSON config to create a Mysql server using Docker image > and Persistent Volume > {code:javascript} > { > "env": { > "MYSQL_USER": "wordpress", > "MYSQL_PASSWORD": "secret", > "MYSQL_ROOT_PASSWORD": "supersecret", > "MYSQL_DATABASE": "wordpress" > }, > "id": "/mysqlgc", > "backoffFactor": 1.15, > "backoffSeconds": 1, > "constraints": [ > [ > "hostname", > "IS", > "172.27.12.216" > ] > ], > "container": { > "portMappings": [ > { > "containerPort": 3306, > "hostPort": 0, > "protocol": "tcp", > "servicePort": 1 > } > ], > "type": "DOCKER", > "volumes": [ > { > "persistent": { > "type": "root", > "size": 1000, > "constraints": [] > }, > "mode": "RW", > "containerPath": "mysqldata" > }, > { > "containerPath": "/var/lib/mysql", > "hostPath": "mysqldata", > "mode": "RW" > } > ], > "docker": { > "image": "mysql", > "forcePullImage": false, > "privileged": false, > "parameters": [] > } > }, > "cpus": 1, > "disk": 0, > "instances": 1, > "maxLaunchDelaySeconds": 3600, > "mem": 512, > "gpus": 0, > "networks": [ > { > "mode": "container/bridge" > } > ], > "residency": { > "relaunchEscalationTimeoutSeconds": 3600, > "taskLostBehavior": "WAIT_FOREVER" > }, > "requirePorts": false, > "upgradeStrategy": { > "maximumOverCapacity": 0, > "minimumHealthCapacity": 0 > }, > "killSelection": "YOUNGEST_FIRST", > "unreachableStrategy": "disabled", > "healthChecks": [], > "fetch": [] > } > {code} > *Step 2:* Kill mysqld process to force rescheduling new Mysql task, but found > 2 mount points to the same persistent volume, it means old mount point did > not be unmounted immediately. > !image-2018-10-24-22-20-51-059.png! > *Step 3:* After GC, data in persistent volume was deleted accidentally, but > mysqld (Mesos task) still running > !image-2018-10-24-22-21-13-399.png! > *Step 4:* Delete Mysql service from Marathon, all mount points unable to > unmount, even the service already deleted. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9223) Storage local provider does not sufficiently handle container launch failures or errors
[ https://issues.apache.org/jira/browse/MESOS-9223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16707318#comment-16707318 ] James DeFelice commented on MESOS-9223: --- MESOS-8380 addresses UI changes. The UI should not be the only place to easily observe/troubleshoot errors. Ideally there'd be an API that exposes such. > Storage local provider does not sufficiently handle container launch failures > or errors > --- > > Key: MESOS-9223 > URL: https://issues.apache.org/jira/browse/MESOS-9223 > Project: Mesos > Issue Type: Improvement > Components: agent, storage >Reporter: Benjamin Bannier >Priority: Critical > > The storage local resource provider as currently implemented does not handle > launch failures or task errors of its standalone containers well enough, If > e.g., a RP container fails to come up during node start a warning would be > logged, but an operator still needs to detect degraded functionality, > manually check the state of containers with {{GET_CONTAINERS}}, and decide > whether the agent needs restarting; I suspect they do not have always have > enough context for this decision. It would be better if the provider would > either enforce a restart by failing over the whole agent, or by retrying the > operation (optionally: up to some maximum amount of retries). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9517) SLRP should treat gRPC timeouts as non-terminal errors, instead of reporting OPERATION_FAILED.
James DeFelice created MESOS-9517: - Summary: SLRP should treat gRPC timeouts as non-terminal errors, instead of reporting OPERATION_FAILED. Key: MESOS-9517 URL: https://issues.apache.org/jira/browse/MESOS-9517 Project: Mesos Issue Type: Bug Components: resource provider, storage Reporter: James DeFelice Assignee: Chun-Hung Hsiao 1. framework executes a CREATE_DISK operation. 2. The SLRP issues a CreateVolume RPC to the plugin 3. The RPC call times out 4. The agent/SLRP translates non-terminal gRPC timeout errors (DeadlineExceeded) for "CreateVolume" calls into OPERATION_FAILED, which is terminal. 5. framework receives a *terminal* OPERATION_FAILED status, so it executes another CREATE_DISK operation. 6. The second CREATE_DISK operation does not timeout. 7. The first CREATE_DISK operation was actually completed by the plugin, unbeknownst to the SLRP. 8. There's now an orphan volume in the storage system that no one is tracking. Proposed solution: the SLRP makes more intelligent decisions about non-terminal gRPC errors. For example, timeouts are likely expected for potentially long-running storage operations and should not be considered terminal. In such cases, the SLRP should NOT report OPERATION_FAILED and instead should re-issue the **same** (idempotent) CreateVolume call to the plugin to ascertain the status of the requested volume creation. Agent logs for the 3 orphan vols above: {code} [jdefelice@ec101 DCOS-46889]$ grep -e 3bd1a1a9-43d3-485c-9275-59cebd64b07c agent.log Jan 09 11:10:27 ip-10-10-0-28.us-west-2.compute.internal mesos-agent[13170]: I0109 11:10:27.896306 13189 provider.cpp:1548] Received CREATE_DISK operation 'a1BdfrEhy4ZLSNPZbDrzp1h-0' (uuid: 3bd1a1a9-43d3-485c-9275-59cebd64b07c) Jan 09 11:11:27 ip-10-10-0-28.us-west-2.compute.internal mesos-agent[13170]: E0109 11:11:27.904057 13190 provider.cpp:1605] Failed to apply operation (uuid: 3bd1a1a9-43d3-485c-9275-59cebd64b07c): Deadline Exceeded Jan 09 11:11:27 ip-10-10-0-28.us-west-2.compute.internal mesos-agent[13170]: I0109 11:11:27.904058 13192 status_update_manager_process.hpp:152] Received operation status update OPERATION_FAILED (Status UUID: 8c1ddad1-4adb-4df5-91fe-235d265a71d8) for operation UUID 3bd1a1a9-43d3-485c-9275-59cebd64b07c (framework-supplied ID 'a1BdfrEhy4ZLSNPZbDrzp1h-0') of framework 'c0b7cc7e-db35-450d-bf25-9e3183a07161-0002' on agent c0b7cc7e-db35-450d-bf25-9e3183a07161-S1 Jan 09 11:11:27 ip-10-10-0-28.us-west-2.compute.internal mesos-agent[13170]: I0109 11:11:27.904331 13192 status_update_manager_process.hpp:929] Checkpointing UPDATE for operation status update OPERATION_FAILED (Status UUID: 8c1ddad1-4adb-4df5-91fe-235d265a71d8) for operation UUID 3bd1a1a9-43d3-485c-9275-59cebd64b07c (framework-supplied ID 'a1BdfrEhy4ZLSNPZbDrzp1h-0') of framework 'c0b7cc7e-db35-450d-bf25-9e3183a07161-0002' on agent c0b7cc7e-db35-450d-bf25-9e3183a07161-S1 Jan 09 11:11:27 ip-10-10-0-28.us-west-2.compute.internal mesos-agent[13170]: I0109 11:11:27.947286 13189 slave.cpp:7696] Handling resource provider message 'UPDATE_OPERATION_STATUS: (uuid: 3bd1a1a9-43d3-485c-9275-59cebd64b07c) for framework c0b7cc7e-db35-450d-bf25-9e3183a07161-0002 (latest state: OPERATION_FAILED, status update state: OPERATION_FAILED)' Jan 09 11:11:27 ip-10-10-0-28.us-west-2.compute.internal mesos-agent[13170]: I0109 11:11:27.947376 13189 slave.cpp:8034] Updating the state of operation 'a1BdfrEhy4ZLSNPZbDrzp1h-0' (uuid: 3bd1a1a9-43d3-485c-9275-59cebd64b07c) for framework c0b7cc7e-db35-450d-bf25-9e3183a07161-0002 (latest state: OPERATION_FAILED, status update state: OPERATION_FAILED) Jan 09 11:11:27 ip-10-10-0-28.us-west-2.compute.internal mesos-agent[13170]: I0109 11:11:27.947407 13189 slave.cpp:7890] Forwarding status update of operation 'a1BdfrEhy4ZLSNPZbDrzp1h-0' (operation_uuid: 3bd1a1a9-43d3-485c-9275-59cebd64b07c) for framework c0b7cc7e-db35-450d-bf25-9e3183a07161-0002 Jan 09 11:11:27 ip-10-10-0-28.us-west-2.compute.internal mesos-agent[13170]: I0109 11:11:27.952689 13193 status_update_manager_process.hpp:252] Received operation status update acknowledgement (UUID: 8c1ddad1-4adb-4df5-91fe-235d265a71d8) for stream 3bd1a1a9-43d3-485c-9275-59cebd64b07c Jan 09 11:11:27 ip-10-10-0-28.us-west-2.compute.internal mesos-agent[13170]: I0109 11:11:27.952725 13193 status_update_manager_process.hpp:929] Checkpointing ACK for operation status update OPERATION_FAILED (Status UUID: 8c1ddad1-4adb-4df5-91fe-235d265a71d8) for operation UUID 3bd1a1a9-43d3-485c-9275-59cebd64b07c (framework-supplied ID 'a1BdfrEhy4ZLSNPZbDrzp1h-0') of framework 'c0b7cc7e-db35-450d-bf25-9e3183a07161-0002' on agent c0b7cc7e-db35-450d-bf25-9e3183a07161-S1 [jdefelice@ec101 DCOS-46889]$ grep -e 4acf1495-1a36-4939-a71b-75ca5aa73657 agent.log Jan 09 11:10:28 ip-10-10-0-28.us-west-2.compute.internal mesos-agent[131
[jira] [Commented] (MESOS-9523) Add per-framework allocatable resources matcher/filter.
[ https://issues.apache.org/jira/browse/MESOS-9523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16744105#comment-16744105 ] James DeFelice commented on MESOS-9523: --- Is there already a design that justifies support for complex expressions beyond "min_allocatable_resources"? if so, would you mind dropping link in the description of this ticket? > Add per-framework allocatable resources matcher/filter. > --- > > Key: MESOS-9523 > URL: https://issues.apache.org/jira/browse/MESOS-9523 > Project: Mesos > Issue Type: Improvement > Components: allocation >Reporter: Meng Zhu >Assignee: Benjamin Bannier >Priority: Major > Labels: mesosphere, storage > > Currently, Mesos has a single global flag `min_allocatable_resources` that > provides some control over the shape of the offer. But, being a global flag, > finding a one-size-fits-all shape is hard and less than ideal. It will be > great if frameworks can specify different shapes based on their needs. > In addition to extending this flag to be per-framework. It is also a good > opportunity to see if it can be more than `min_alloctable` e.g. providing > more predicates such as max, (not) contain and etc. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9586) mesos/mesos-centos nightly images should include development headers
James DeFelice created MESOS-9586: - Summary: mesos/mesos-centos nightly images should include development headers Key: MESOS-9586 URL: https://issues.apache.org/jira/browse/MESOS-9586 Project: Mesos Issue Type: Improvement Reporter: James DeFelice The existing Mesos nightly images greatly simplify the process of tracking the Mesos master branch w/ integration tests. Our integration tests now have a new requirement: we'd like to build Mesos modules against the latest master nightlies. This is difficult with the current mesos/mesos-centos dockerhub images because they don't include the development headers. Ideally, these headers would be available in this (or a sibling) image. [https://github.com/apache/mesos/blob/master/support/packaging/centos/build-docker-centos.sh#L25-L29] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9590) Mesos CI sometimes, incorrectly, overwrites already-pushed mesos master nightly images with new images built from non-master branches.
James DeFelice created MESOS-9590: - Summary: Mesos CI sometimes, incorrectly, overwrites already-pushed mesos master nightly images with new images built from non-master branches. Key: MESOS-9590 URL: https://issues.apache.org/jira/browse/MESOS-9590 Project: Mesos Issue Type: Bug Reporter: James DeFelice Assignee: Jie Yu I pulled image mesos/mesos-centos:master-2019-02-15 some time on the 15th and worked with it locally, on my laptop, for about a week. Part of that work included downloading the related mesos-xxx-devel.rpm from the same CI build that produced the image so that I could build 3rd party mesos modules from the master base image. The rpm was labeled as pre-1.8.0. This worked great until I tried to repeat the work on another machine. The other machine pulled the "same" dockerhub image (mesos/mesos-centos:master-2019-02-15) which was somehow built with a mesos-xxx.rpm labeled as pre-1.7.2. I couldn't build my docker image using this strangely new base because the mesos-xxx-devel.rpm I had hardcoded into the dockerfile no longer aligned with the version of the mesos RPM that was shipping in the base image. The base image had changed, such that the mesos RPM version went from 1.8.0 to 1.7.2. This should never happen. [~jieyu] investigated and found that the problem appears to happen at random. Current thinking is that one of the mesos CI boxes uses a version of git that's too old, and that the CI scripts are incorrectly ignoring a git command failure: the git command fails because the git version is too old, and the script subsequently ignores any failures from the command pipeline in which this command is executed. With the result being that the "version" of the branch being built cannot be detected and therefore defaults to master - overwriting *actual* master image builds. [~jieyu] also wrote some patches, which I'll link here: * https://reviews.apache.org/r/70024/ * https://reviews.apache.org/r/70025/ -- This message was sent by Atlassian JIRA (v7.6.3#76005)