[jira] [Commented] (MESOS-3352) Problem Statement Summary for Systemd Cgroup Launcher

2015-09-14 Thread James DeFelice (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14743885#comment-14743885
 ] 

James DeFelice commented on MESOS-3352:
---

For the proposed independent slice that will house tasks, is the name/path
of that slice predictable/discoverable from within a custom executor?

On Mon, Sep 14, 2015 at 11:01 AM, Joris Van Remoortere (JIRA) <




-- 
James DeFelice
585.241.9488 (voice)
650.649.6071 (fax)


> Problem Statement Summary for Systemd Cgroup Launcher
> -
>
> Key: MESOS-3352
> URL: https://issues.apache.org/jira/browse/MESOS-3352
> Project: Mesos
>  Issue Type: Task
>Reporter: Joris Van Remoortere
>Assignee: Joris Van Remoortere
>  Labels: design, mesosphere, systemd
>
> There have been many reports of cgroups related issues when running Mesos on 
> Systemd.
> Many of these issues are rooted in the manual manipulation of the cgroups 
> filesystem by Mesos.
> This task is to describe the problem in a 1-page summary, and elaborate on 
> the suggested 2 part solution:
> 1. Using the {{delegate=true}} flag for the slave
> 2. Implementing a Systemd launcher to run executors with tighter Systemd 
> integration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-3408) Labels field of FrameworkInfo should be added into v1 mesos.proto

2015-09-15 Thread James DeFelice (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James DeFelice updated MESOS-3408:
--
Labels: mesosphere  (was: )

> Labels field of FrameworkInfo should be added into v1 mesos.proto
> -
>
> Key: MESOS-3408
> URL: https://issues.apache.org/jira/browse/MESOS-3408
> Project: Mesos
>  Issue Type: Bug
>Reporter: Qian Zhang
>Assignee: Qian Zhang
>  Labels: mesosphere
> Fix For: 0.25.0
>
>
> In [MESOS-2841|https://issues.apache.org/jira/browse/MESOS-2841], a new field 
> "Labels" has been added into FrameworkInfo in mesos.proto, but is missed in 
> v1 mesos.proto.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3507) As an operator, I want a way to inspect queued tasks in running schedulers

2015-09-25 Thread James DeFelice (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14909004#comment-14909004
 ] 

James DeFelice commented on MESOS-3507:
---

I like the idea of frameworks publishing multiple urls. Perhaps labeled. Maybe 
an endpoint message that consists of a name and a url. Possibly add acl or 
visibility later. Frameworks could publish multiple endpoints. This would be 
great for kubernetes.

> As an operator, I want a way to inspect queued tasks in running schedulers
> --
>
> Key: MESOS-3507
> URL: https://issues.apache.org/jira/browse/MESOS-3507
> Project: Mesos
>  Issue Type: Story
>Reporter: Niklas Quarfot Nielsen
>
> Currently, there is no uniform way of getting a notion of 'awaiting' tasks 
> i.e. expressing that a framework has more work to do. This information is 
> useful for auto-scaling and anomaly detection systems. Schedulers tend to 
> expose this over their own http endpoints, but the format across schedulers 
> are most likely not compatible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3352) Problem Statement Summary for Systemd Cgroup Launcher

2015-10-01 Thread James DeFelice (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14939795#comment-14939795
 ] 

James DeFelice commented on MESOS-3352:
---

What release is this targetting?

> Problem Statement Summary for Systemd Cgroup Launcher
> -
>
> Key: MESOS-3352
> URL: https://issues.apache.org/jira/browse/MESOS-3352
> Project: Mesos
>  Issue Type: Task
>Reporter: Joris Van Remoortere
>Assignee: Joris Van Remoortere
>  Labels: design, mesosphere, systemd
>
> There have been many reports of cgroups related issues when running Mesos on 
> Systemd.
> Many of these issues are rooted in the manual manipulation of the cgroups 
> filesystem by Mesos.
> This task is to describe the problem in a 1-page summary, and elaborate on 
> the suggested 2 part solution:
> 1. Using the {{delegate=true}} flag for the slave
> 2. Implementing a Systemd launcher to run executors with tighter Systemd 
> integration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3352) Problem Statement Summary for Systemd Cgroup Launcher

2015-10-03 Thread James DeFelice (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14942537#comment-14942537
 ] 

James DeFelice commented on MESOS-3352:
---

Awesome. Thanks for the heads up
On Oct 3, 2015 6:48 PM, "Joris Van Remoortere (JIRA)" 



> Problem Statement Summary for Systemd Cgroup Launcher
> -
>
> Key: MESOS-3352
> URL: https://issues.apache.org/jira/browse/MESOS-3352
> Project: Mesos
>  Issue Type: Task
>Reporter: Joris Van Remoortere
>Assignee: Joris Van Remoortere
>  Labels: design, mesosphere, systemd
>
> There have been many reports of cgroups related issues when running Mesos on 
> Systemd.
> Many of these issues are rooted in the manual manipulation of the cgroups 
> filesystem by Mesos.
> This task is to describe the problem in a 1-page summary, and elaborate on 
> the suggested 2 part solution:
> 1. Using the {{delegate=true}} flag for the slave
> 2. Implementing a Systemd launcher to run executors with tighter Systemd 
> integration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-4065) slave FD for ZK tcp connection leaked to executor process

2015-12-03 Thread James DeFelice (JIRA)
James DeFelice created MESOS-4065:
-

 Summary: slave FD for ZK tcp connection leaked to executor process
 Key: MESOS-4065
 URL: https://issues.apache.org/jira/browse/MESOS-4065
 Project: Mesos
  Issue Type: Bug
Affects Versions: 0.25.0, 0.24.1
Reporter: James DeFelice


{code}
core@ip-10-0-0-45 ~ $ ps auxwww|grep -e etcd
root  1432 99.3  0.0 202420 12928 ?Rsl  21:32  13:51 
./etcd-mesos-executor -log_dir=./
root  1450  0.4  0.1  38332 28752 ?Sl   21:32   0:03 ./etcd 
--data-dir=etcd_data --name=etcd-1449178273 
--listen-peer-urls=http://10.0.0.45:1025 
--initial-advertise-peer-urls=http://10.0.0.45:1025 
--listen-client-urls=http://10.0.0.45:1026 
--advertise-client-urls=http://10.0.0.45:1026 
--initial-cluster=etcd-1449178273=http://10.0.0.45:1025,etcd-1449178271=http://10.0.2.95:1025,etcd-1449178272=http://10.0.2.216:1025
 --initial-cluster-state=existing
core  1651  0.0  0.0   6740   928 pts/0S+   21:46   0:00 grep 
--colour=auto -e etcd

core@ip-10-0-0-45 ~ $ sudo lsof -p 1432|grep -e 2181
etcd-meso 1432 root   10u IPv4  21973  0t0TCP 
ip-10-0-0-45.us-west-2.compute.internal:54016->ip-10-0-5-206.us-west-2.compute.internal:2181
 (ESTABLISHED)

core@ip-10-0-0-45 ~ $ ps auxwww|grep -e slave
root  1124  0.2  0.1 900496 25736 ?Ssl  21:11   0:04 
/opt/mesosphere/packages/mesos--52cbecde74638029c3ba0ac5e5ab81df8debf0fa/sbin/mesos-slave
core  1658  0.0  0.0   6740   832 pts/0S+   21:46   0:00 grep 
--colour=auto -e slave

core@ip-10-0-0-45 ~ $ sudo lsof -p 1124|grep -e 2181
mesos-sla 1124 root   10u IPv4  21973  0t0TCP 
ip-10-0-0-45.us-west-2.compute.internal:54016->ip-10-0-5-206.us-west-2.compute.internal:2181
 (ESTABLISHED)
{code}

I only tested against mesos 0.24.1 and 0.25.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4065) slave FD for ZK tcp connection leaked to executor process

2015-12-07 Thread James DeFelice (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15046360#comment-15046360
 ] 

James DeFelice commented on MESOS-4065:
---

is this going to require a change to the zookeeper C client bindings? e.g. 
http://svn.apache.org/viewvc/zookeeper/branches/branch-3.5/src/c/src/zookeeper.c?view=markup
 somewhere around line 2203, adding a O_CLOEXEC to the socket() call?

> slave FD for ZK tcp connection leaked to executor process
> -
>
> Key: MESOS-4065
> URL: https://issues.apache.org/jira/browse/MESOS-4065
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.24.1, 0.25.0
>Reporter: James DeFelice
>  Labels: mesosphere, security
>
> {code}
> core@ip-10-0-0-45 ~ $ ps auxwww|grep -e etcd
> root  1432 99.3  0.0 202420 12928 ?Rsl  21:32  13:51 
> ./etcd-mesos-executor -log_dir=./
> root  1450  0.4  0.1  38332 28752 ?Sl   21:32   0:03 ./etcd 
> --data-dir=etcd_data --name=etcd-1449178273 
> --listen-peer-urls=http://10.0.0.45:1025 
> --initial-advertise-peer-urls=http://10.0.0.45:1025 
> --listen-client-urls=http://10.0.0.45:1026 
> --advertise-client-urls=http://10.0.0.45:1026 
> --initial-cluster=etcd-1449178273=http://10.0.0.45:1025,etcd-1449178271=http://10.0.2.95:1025,etcd-1449178272=http://10.0.2.216:1025
>  --initial-cluster-state=existing
> core  1651  0.0  0.0   6740   928 pts/0S+   21:46   0:00 grep 
> --colour=auto -e etcd
> core@ip-10-0-0-45 ~ $ sudo lsof -p 1432|grep -e 2181
> etcd-meso 1432 root   10u IPv4  21973  0t0TCP 
> ip-10-0-0-45.us-west-2.compute.internal:54016->ip-10-0-5-206.us-west-2.compute.internal:2181
>  (ESTABLISHED)
> core@ip-10-0-0-45 ~ $ ps auxwww|grep -e slave
> root  1124  0.2  0.1 900496 25736 ?Ssl  21:11   0:04 
> /opt/mesosphere/packages/mesos--52cbecde74638029c3ba0ac5e5ab81df8debf0fa/sbin/mesos-slave
> core  1658  0.0  0.0   6740   832 pts/0S+   21:46   0:00 grep 
> --colour=auto -e slave
> core@ip-10-0-0-45 ~ $ sudo lsof -p 1124|grep -e 2181
> mesos-sla 1124 root   10u IPv4  21973  0t0TCP 
> ip-10-0-0-45.us-west-2.compute.internal:54016->ip-10-0-5-206.us-west-2.compute.internal:2181
>  (ESTABLISHED)
> {code}
> I only tested against mesos 0.24.1 and 0.25.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4065) slave FD for ZK tcp connection leaked to executor process

2015-12-07 Thread James DeFelice (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15046361#comment-15046361
 ] 

James DeFelice commented on MESOS-4065:
---

possible impact on SELinux systems? 
http://danwalsh.livejournal.com/53603.html?page=1

> slave FD for ZK tcp connection leaked to executor process
> -
>
> Key: MESOS-4065
> URL: https://issues.apache.org/jira/browse/MESOS-4065
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.24.1, 0.25.0
>Reporter: James DeFelice
>  Labels: mesosphere, security
>
> {code}
> core@ip-10-0-0-45 ~ $ ps auxwww|grep -e etcd
> root  1432 99.3  0.0 202420 12928 ?Rsl  21:32  13:51 
> ./etcd-mesos-executor -log_dir=./
> root  1450  0.4  0.1  38332 28752 ?Sl   21:32   0:03 ./etcd 
> --data-dir=etcd_data --name=etcd-1449178273 
> --listen-peer-urls=http://10.0.0.45:1025 
> --initial-advertise-peer-urls=http://10.0.0.45:1025 
> --listen-client-urls=http://10.0.0.45:1026 
> --advertise-client-urls=http://10.0.0.45:1026 
> --initial-cluster=etcd-1449178273=http://10.0.0.45:1025,etcd-1449178271=http://10.0.2.95:1025,etcd-1449178272=http://10.0.2.216:1025
>  --initial-cluster-state=existing
> core  1651  0.0  0.0   6740   928 pts/0S+   21:46   0:00 grep 
> --colour=auto -e etcd
> core@ip-10-0-0-45 ~ $ sudo lsof -p 1432|grep -e 2181
> etcd-meso 1432 root   10u IPv4  21973  0t0TCP 
> ip-10-0-0-45.us-west-2.compute.internal:54016->ip-10-0-5-206.us-west-2.compute.internal:2181
>  (ESTABLISHED)
> core@ip-10-0-0-45 ~ $ ps auxwww|grep -e slave
> root  1124  0.2  0.1 900496 25736 ?Ssl  21:11   0:04 
> /opt/mesosphere/packages/mesos--52cbecde74638029c3ba0ac5e5ab81df8debf0fa/sbin/mesos-slave
> core  1658  0.0  0.0   6740   832 pts/0S+   21:46   0:00 grep 
> --colour=auto -e slave
> core@ip-10-0-0-45 ~ $ sudo lsof -p 1124|grep -e 2181
> mesos-sla 1124 root   10u IPv4  21973  0t0TCP 
> ip-10-0-0-45.us-west-2.compute.internal:54016->ip-10-0-5-206.us-west-2.compute.internal:2181
>  (ESTABLISHED)
> {code}
> I only tested against mesos 0.24.1 and 0.25.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4065) slave FD for ZK tcp connection leaked to executor process

2015-12-08 Thread James DeFelice (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15047064#comment-15047064
 ] 

James DeFelice commented on MESOS-4065:
---

https://issues.apache.org/jira/browse/ZOOKEEPER-2338

> slave FD for ZK tcp connection leaked to executor process
> -
>
> Key: MESOS-4065
> URL: https://issues.apache.org/jira/browse/MESOS-4065
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.24.1, 0.25.0
>Reporter: James DeFelice
>  Labels: mesosphere, security
>
> {code}
> core@ip-10-0-0-45 ~ $ ps auxwww|grep -e etcd
> root  1432 99.3  0.0 202420 12928 ?Rsl  21:32  13:51 
> ./etcd-mesos-executor -log_dir=./
> root  1450  0.4  0.1  38332 28752 ?Sl   21:32   0:03 ./etcd 
> --data-dir=etcd_data --name=etcd-1449178273 
> --listen-peer-urls=http://10.0.0.45:1025 
> --initial-advertise-peer-urls=http://10.0.0.45:1025 
> --listen-client-urls=http://10.0.0.45:1026 
> --advertise-client-urls=http://10.0.0.45:1026 
> --initial-cluster=etcd-1449178273=http://10.0.0.45:1025,etcd-1449178271=http://10.0.2.95:1025,etcd-1449178272=http://10.0.2.216:1025
>  --initial-cluster-state=existing
> core  1651  0.0  0.0   6740   928 pts/0S+   21:46   0:00 grep 
> --colour=auto -e etcd
> core@ip-10-0-0-45 ~ $ sudo lsof -p 1432|grep -e 2181
> etcd-meso 1432 root   10u IPv4  21973  0t0TCP 
> ip-10-0-0-45.us-west-2.compute.internal:54016->ip-10-0-5-206.us-west-2.compute.internal:2181
>  (ESTABLISHED)
> core@ip-10-0-0-45 ~ $ ps auxwww|grep -e slave
> root  1124  0.2  0.1 900496 25736 ?Ssl  21:11   0:04 
> /opt/mesosphere/packages/mesos--52cbecde74638029c3ba0ac5e5ab81df8debf0fa/sbin/mesos-slave
> core  1658  0.0  0.0   6740   832 pts/0S+   21:46   0:00 grep 
> --colour=auto -e slave
> core@ip-10-0-0-45 ~ $ sudo lsof -p 1124|grep -e 2181
> mesos-sla 1124 root   10u IPv4  21973  0t0TCP 
> ip-10-0-0-45.us-west-2.compute.internal:54016->ip-10-0-5-206.us-west-2.compute.internal:2181
>  (ESTABLISHED)
> {code}
> I only tested against mesos 0.24.1 and 0.25.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-4120) Make DiscoveryInfo dynamically updatable

2015-12-10 Thread James DeFelice (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15051817#comment-15051817
 ] 

James DeFelice edited comment on MESOS-4120 at 12/10/15 11:10 PM:
--

>From a K8s integration perspective it's preferable that some sidecar framework 
>component could update the task's DiscoveryInfo vs. putting all of that 
>responsibility on the executor.


was (Author: jdef):
>From a K8s integration perspective it's preferable that some sidecar framework 
>component could update the a task's DiscoveryInfo vs. putting all of that 
>responsibility on the executor.

> Make DiscoveryInfo dynamically updatable
> 
>
> Key: MESOS-4120
> URL: https://issues.apache.org/jira/browse/MESOS-4120
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Sargun Dhillon
>Priority: Critical
>  Labels: mesosphere
>
> K8s tasks can dynamically update what they expose to make discoverable by the 
> cluster. Unfortunately, all DiscoveryInfo the cluster is immutable, at the 
> time of task start. 
> We would like to enable DiscoveryInfo to be dynamically updatable, so that 
> executors can change what they're advertising based on their internal state, 
> versus requiring DiscoveryInfo to be known prior to starting the tasks. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4120) Make DiscoveryInfo dynamically updatable

2015-12-10 Thread James DeFelice (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15051817#comment-15051817
 ] 

James DeFelice commented on MESOS-4120:
---

>From a K8s integration perspective it's preferable that some sidecar framework 
>component could update the a task's DiscoveryInfo vs. putting all of that 
>responsibility on the executor.

> Make DiscoveryInfo dynamically updatable
> 
>
> Key: MESOS-4120
> URL: https://issues.apache.org/jira/browse/MESOS-4120
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Sargun Dhillon
>Priority: Critical
>  Labels: mesosphere
>
> K8s tasks can dynamically update what they expose to make discoverable by the 
> cluster. Unfortunately, all DiscoveryInfo the cluster is immutable, at the 
> time of task start. 
> We would like to enable DiscoveryInfo to be dynamically updatable, so that 
> executors can change what they're advertising based on their internal state, 
> versus requiring DiscoveryInfo to be known prior to starting the tasks. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4120) Make DiscoveryInfo dynamically updatable

2015-12-12 Thread James DeFelice (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15054716#comment-15054716
 ] 

James DeFelice commented on MESOS-4120:
---

If there's no support for updates via sidecar then a significant consequence 
(of forcing service/endpoint metadata through a task's DiscoveryInfo) is that 
every kubelet in the cluster will have to watch all changes to all endpoints in 
the cluster. This will not scale.

OTOH if a sidecar component can make updates to DiscoveryInfo then we can run a 
single instance of the sidecar (or sharded instances) and that should scale 
much, much nicer.

In k8s there's no way to attach annotations to individual endpoint addresses, 
instead we'd need to somehow annotate the entire Endpoints struct with 
key/values such that taskId is reasonably discoverable by this sidecar. This is 
ugly and forces us to maintain the forked version of the custom k8s 
endpoint-controller that we maintain (which we ultimately want to drop support 
for).

Overall I'd prefer to advertise service discovery metadata through a different 
channel than a task's DiscoveryInfo.

> Make DiscoveryInfo dynamically updatable
> 
>
> Key: MESOS-4120
> URL: https://issues.apache.org/jira/browse/MESOS-4120
> Project: Mesos
>  Issue Type: Improvement
>Affects Versions: 0.25.0, 0.26.0
>Reporter: Sargun Dhillon
>Priority: Critical
>  Labels: mesosphere
>
> K8s tasks can dynamically update what they expose to make discoverable by the 
> cluster. Unfortunately, all DiscoveryInfo the cluster is immutable, at the 
> time of task start. 
> We would like to enable DiscoveryInfo to be dynamically updatable, so that 
> executors can change what they're advertising based on their internal state, 
> versus requiring DiscoveryInfo to be known prior to starting the tasks. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4086) Containerizer logging modularization

2015-12-12 Thread James DeFelice (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15054825#comment-15054825
 ] 

James DeFelice commented on MESOS-4086:
---

I left this comment in the design doc too, related to the logging module API 
call that's invoked per-executor creation:
{quote}
I'd like to see the executor labels passed in here so that frameworks can 
advertise additional metadata for logging modules to take advantage of. for 
example, one or more LOGDIRSx=... vars could instruct the logging module to 
monitor additional directories in the sandbox for log files.
{quote}

> Containerizer logging modularization
> 
>
> Key: MESOS-4086
> URL: https://issues.apache.org/jira/browse/MESOS-4086
> Project: Mesos
>  Issue Type: Epic
>  Components: containerization, modules
>Reporter: Joseph Wu
>Assignee: Joseph Wu
>  Labels: logging, mesosphere
>
> Executors and tasks are configured (via the various containerizers) to write 
> their output (stdout/stderr) to files ("stdout" and "stderr") on an agent's 
> disk.
> Unlike Master/Agent logs, executor/task logs are not attached to any formal 
> logging system, like {{glog}}.  As such, there is significant scope for 
> improvement.
> By introducing a module for logging, we can provide a common/programmatic way 
> to access and manage executor/task logs.  Modules could implement additional 
> sinks for logs, such as:
> * to the sandbox (the status quo),
> * to syslog,
> * to journald
> This would also provide the hooks to deal with logging related problems, such 
> as:
> * the (current) lack of log rotation,
> * searching through executor/task logs (i.e. via aggregation)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-4548) Errors communicated to the scheduler should be associated with stable error codes.

2016-01-28 Thread James DeFelice (JIRA)
James DeFelice created MESOS-4548:
-

 Summary: Errors communicated to the scheduler should be associated 
with stable error codes.
 Key: MESOS-4548
 URL: https://issues.apache.org/jira/browse/MESOS-4548
 Project: Mesos
  Issue Type: Improvement
Reporter: James DeFelice


For example, in mesos 0.24 there was a change to the error message generated by 
the master when a previously removed framework attempts to re-register: 
https://github.com/apache/mesos/commit/8661672d80cbe3ebd05e68a6fc4167b54ea139ef

Some frameworks, rightly or not, attempt to compare the generated error string 
to "Completed framework attempted to re-register" which changed in mesos 0.24 
to "Framework has been removed". These frameworks are now broken with respect 
to this aspect of their error handling, at least until they're changed to check 
for the new error string.

Arguably frameworks shouldn't be comparing error strings since they're  not 
guaranteed to remain stable across releases. However, mesos currently offers no 
alternative since there's no error **code** in the API.

Furthermore, with the rise of the HTTP API there's room for two classes of 
errors: synchronous validation errors vs. asynchronous errors. It would be 
ideal to have meaningful 4xx error code responses for synchronous errors as 
well as error codes for asynchronous errors delivered via ERROR events. These 
error codes would become part of a stable API that mesos would treat just like 
the rest of its APIs, allowing for deprecation cycles before breaking changes - 
or at the very least a release note indicating an immediate breaking change.

/cc [~vinodkone]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4548) Errors communicated to the scheduler should be associated with stable error codes.

2016-01-28 Thread James DeFelice (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James DeFelice updated MESOS-4548:
--
Affects Version/s: 0.24.0

> Errors communicated to the scheduler should be associated with stable error 
> codes.
> --
>
> Key: MESOS-4548
> URL: https://issues.apache.org/jira/browse/MESOS-4548
> Project: Mesos
>  Issue Type: Improvement
>Affects Versions: 0.24.0
>Reporter: James DeFelice
>  Labels: mesosphere
>
> For example, in mesos 0.24 there was a change to the error message generated 
> by the master when a previously removed framework attempts to re-register: 
> https://github.com/apache/mesos/commit/8661672d80cbe3ebd05e68a6fc4167b54ea139ef
> Some frameworks, rightly or not, attempt to compare the generated error 
> string to "Completed framework attempted to re-register" which changed in 
> mesos 0.24 to "Framework has been removed". These frameworks are now broken 
> with respect to this aspect of their error handling, at least until they're 
> changed to check for the new error string.
> Arguably frameworks shouldn't be comparing error strings since they're  not 
> guaranteed to remain stable across releases. However, mesos currently offers 
> no alternative since there's no error **code** in the API.
> Furthermore, with the rise of the HTTP API there's room for two classes of 
> errors: synchronous validation errors vs. asynchronous errors. It would be 
> ideal to have meaningful 4xx error code responses for synchronous errors as 
> well as error codes for asynchronous errors delivered via ERROR events. These 
> error codes would become part of a stable API that mesos would treat just 
> like the rest of its APIs, allowing for deprecation cycles before breaking 
> changes - or at the very least a release note indicating an immediate 
> breaking change.
> /cc [~vinodkone]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4548) Errors communicated to the scheduler should be associated with stable error codes.

2016-01-28 Thread James DeFelice (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James DeFelice updated MESOS-4548:
--
Description: 
For example, in mesos 0.24 there was a change to the error message generated by 
the master when a previously removed framework attempts to re-register: 
https://github.com/apache/mesos/commit/8661672d80cbe3ebd05e68a6fc4167b54ea139ef

Some frameworks, rightly or not, attempt to compare the generated error string 
to "Completed framework attempted to re-register" which changed in mesos 0.24 
to "Framework has been removed". These frameworks are now broken with respect 
to this aspect of their error handling, at least until they're changed to check 
for the new error string.

Arguably frameworks shouldn't be comparing error strings since they're  not 
guaranteed to remain stable across releases. However, mesos currently offers no 
alternative since there's no error **code** in the API.

Furthermore, with the rise of the HTTP API there's room for two classes of 
errors: synchronous validation errors vs. asynchronous errors. It would be 
ideal to have meaningful 4xx error code responses for synchronous errors as 
well as error codes for asynchronous errors delivered via ERROR events. These 
error codes would become part of a stable API that mesos would treat just like 
the rest of its APIs, allowing for deprecation cycles before breaking changes - 
or at the very least a release note indicating an immediate breaking change.

/cc [~vinodkone], [~bmahler]

  was:
For example, in mesos 0.24 there was a change to the error message generated by 
the master when a previously removed framework attempts to re-register: 
https://github.com/apache/mesos/commit/8661672d80cbe3ebd05e68a6fc4167b54ea139ef

Some frameworks, rightly or not, attempt to compare the generated error string 
to "Completed framework attempted to re-register" which changed in mesos 0.24 
to "Framework has been removed". These frameworks are now broken with respect 
to this aspect of their error handling, at least until they're changed to check 
for the new error string.

Arguably frameworks shouldn't be comparing error strings since they're  not 
guaranteed to remain stable across releases. However, mesos currently offers no 
alternative since there's no error **code** in the API.

Furthermore, with the rise of the HTTP API there's room for two classes of 
errors: synchronous validation errors vs. asynchronous errors. It would be 
ideal to have meaningful 4xx error code responses for synchronous errors as 
well as error codes for asynchronous errors delivered via ERROR events. These 
error codes would become part of a stable API that mesos would treat just like 
the rest of its APIs, allowing for deprecation cycles before breaking changes - 
or at the very least a release note indicating an immediate breaking change.

/cc [~vinodkone]


> Errors communicated to the scheduler should be associated with stable error 
> codes.
> --
>
> Key: MESOS-4548
> URL: https://issues.apache.org/jira/browse/MESOS-4548
> Project: Mesos
>  Issue Type: Improvement
>Affects Versions: 0.24.0
>Reporter: James DeFelice
>  Labels: mesosphere
>
> For example, in mesos 0.24 there was a change to the error message generated 
> by the master when a previously removed framework attempts to re-register: 
> https://github.com/apache/mesos/commit/8661672d80cbe3ebd05e68a6fc4167b54ea139ef
> Some frameworks, rightly or not, attempt to compare the generated error 
> string to "Completed framework attempted to re-register" which changed in 
> mesos 0.24 to "Framework has been removed". These frameworks are now broken 
> with respect to this aspect of their error handling, at least until they're 
> changed to check for the new error string.
> Arguably frameworks shouldn't be comparing error strings since they're  not 
> guaranteed to remain stable across releases. However, mesos currently offers 
> no alternative since there's no error **code** in the API.
> Furthermore, with the rise of the HTTP API there's room for two classes of 
> errors: synchronous validation errors vs. asynchronous errors. It would be 
> ideal to have meaningful 4xx error code responses for synchronous errors as 
> well as error codes for asynchronous errors delivered via ERROR events. These 
> error codes would become part of a stable API that mesos would treat just 
> like the rest of its APIs, allowing for deprecation cycles before breaking 
> changes - or at the very least a release note indicating an immediate 
> breaking change.
> /cc [~vinodkone], [~bmahler]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-4565) slave recovers and attempt to destroy executor's child containers, then begins rejecting task status updates

2016-01-29 Thread James DeFelice (JIRA)
James DeFelice created MESOS-4565:
-

 Summary: slave recovers and attempt to destroy executor's child 
containers, then begins rejecting task status updates
 Key: MESOS-4565
 URL: https://issues.apache.org/jira/browse/MESOS-4565
 Project: Mesos
  Issue Type: Bug
Affects Versions: 0.26.0
Reporter: James DeFelice


AFAICT the slave is doing this:

1) recovering from some kind of failure
2) checking the containers that it pulled from its state store
3) complaining about cgroup children hanging off of executor containers
4) rejecting task status updates related to the executor container, the first 
of which in the logs is:

{code}
E0130 02:22:21.979852 12683 slave.cpp:2963] Failed to update resources for 
container 1d965a20-849c-40d8-9446-27cb723220a9 of executor 
'd701ab48a0c0f13_k8sm-executor' running task 
pod.f2dc2c43-c6f7-11e5-ad28-0ad18c5e6c7f on status update for terminal task, 
destroying container: Container '1d965a20-849c-40d8-9446-27cb723220a9' not found
{code}

To be fair, I don't believe that my custom executor is re-registering properly 
with the slave prior to attempting to send these (failing) status updates. But 
the slave doesn't complain about that .. it complains that it can't find the 
**container**.

slave log here:
https://gist.github.com/jdef/265663461156b7a7ed4e



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-416) Ensure master / slave do not get kernel OOM before executors, by setting oom_adj control.

2016-02-01 Thread James DeFelice (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15127272#comment-15127272
 ] 

James DeFelice commented on MESOS-416:
--

AKA oom_score_adj ?

> Ensure master / slave do not get kernel OOM before executors, by setting 
> oom_adj control.
> -
>
> Key: MESOS-416
> URL: https://issues.apache.org/jira/browse/MESOS-416
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Benjamin Mahler
>  Labels: twitter
>
> We can adjust the /proc//oom_adj control during master / slave startup, 
> setting it to a low value to ensure we aren't killed first during an OOM.
> Relevant LWN article: http://lwn.net/Articles/317814/
> Also relevant: https://bugzilla.redhat.com/show_bug.cgi?id=239313



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-416) Ensure master / slave do not get kernel OOM before executors, by setting oom_adj control.

2016-02-01 Thread James DeFelice (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James DeFelice updated MESOS-416:
-
Labels: mesosphere security twitter  (was: mesosphere twitter)

> Ensure master / slave do not get kernel OOM before executors, by setting 
> oom_adj control.
> -
>
> Key: MESOS-416
> URL: https://issues.apache.org/jira/browse/MESOS-416
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Benjamin Mahler
>  Labels: mesosphere, security, twitter
>
> We can adjust the /proc//oom_adj control during master / slave startup, 
> setting it to a low value to ensure we aren't killed first during an OOM.
> Relevant LWN article: http://lwn.net/Articles/317814/
> Also relevant: https://bugzilla.redhat.com/show_bug.cgi?id=239313



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-416) Ensure master / slave do not get kernel OOM before executors, by setting oom_adj control.

2016-02-01 Thread James DeFelice (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15127282#comment-15127282
 ] 

James DeFelice commented on MESOS-416:
--

FWIW kubernetes is doing this already for its important procs

> Ensure master / slave do not get kernel OOM before executors, by setting 
> oom_adj control.
> -
>
> Key: MESOS-416
> URL: https://issues.apache.org/jira/browse/MESOS-416
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Benjamin Mahler
>  Labels: mesosphere, security, twitter
>
> We can adjust the /proc//oom_adj control during master / slave startup, 
> setting it to a low value to ensure we aren't killed first during an OOM.
> Relevant LWN article: http://lwn.net/Articles/317814/
> Also relevant: https://bugzilla.redhat.com/show_bug.cgi?id=239313



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4565) slave recovers and attempt to destroy executor's child containers, then begins rejecting task status updates

2016-02-03 Thread James DeFelice (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15131778#comment-15131778
 ] 

James DeFelice commented on MESOS-4565:
---

To be clear the custom executor in this case is using the native
containerizer, not the docker one.



> slave recovers and attempt to destroy executor's child containers, then 
> begins rejecting task status updates
> 
>
> Key: MESOS-4565
> URL: https://issues.apache.org/jira/browse/MESOS-4565
> Project: Mesos
>  Issue Type: Bug
>  Components: docker
>Affects Versions: 0.26.0
>Reporter: James DeFelice
>  Labels: mesosphere
>
> AFAICT the slave is doing this:
> 1) recovering from some kind of failure
> 2) checking the containers that it pulled from its state store
> 3) complaining about cgroup children hanging off of executor containers
> 4) rejecting task status updates related to the executor container, the first 
> of which in the logs is:
> {code}
> E0130 02:22:21.979852 12683 slave.cpp:2963] Failed to update resources for 
> container 1d965a20-849c-40d8-9446-27cb723220a9 of executor 
> 'd701ab48a0c0f13_k8sm-executor' running task 
> pod.f2dc2c43-c6f7-11e5-ad28-0ad18c5e6c7f on status update for terminal task, 
> destroying container: Container '1d965a20-849c-40d8-9446-27cb723220a9' not 
> found
> {code}
> To be fair, I don't believe that my custom executor is re-registering 
> properly with the slave prior to attempting to send these (failing) status 
> updates. But the slave doesn't complain about that .. it complains that it 
> can't find the **container**.
> slave log here:
> https://gist.github.com/jdef/265663461156b7a7ed4e



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-7492) Introduce a daemon manager in the agent.

2017-05-25 Thread James DeFelice (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16025484#comment-16025484
 ] 

James DeFelice commented on MESOS-7492:
---

aren't {{poll_interval}} and {{initial_delay}} baked into {{CheckInfo}} already?

> Introduce a daemon manager in the agent.
> 
>
> Key: MESOS-7492
> URL: https://issues.apache.org/jira/browse/MESOS-7492
> Project: Mesos
>  Issue Type: Task
>Reporter: Jie Yu
>
> Once we have standalone container support from the containerizer, we should 
> consider adding a daemon manager inside the agent. It'll be like 'monit', 
> 'upstart' or 'systemd', but with very limited functionalities. For instance, 
> as a start, the manager will simply always restart the daemons if the daemon 
> fails. It'll also try to cleanup unknown daemons.
> This feature will be used to manage CSI plugin containers on the agent.
> The daemon manager should have an interface allowing operators to "register" 
> a daemon with a name and a config of the daemon. The daemon manager is 
> responsible for restarting the daemon if it crashes until some one explicitly 
> "unregister" it. Some simple backoff and health check functionality should be 
> provided.
> We probably need a small design doc for this.
> {code}
> message DaemonConfig {
>   optional ContainerInfo container;
>   optional CommandInfo command;
>   optional uint32 poll_interval;
>   optional uint32 initial_delay;
>   optional CheckInfo check; // For health check.
> }
> class DaemonManager
> {
> public:
>   Future register(
> const ContainerID& containerId,
> const DaemonConfig& config;
>   Future unregister(const ContainerID& containerId);
>   Future> ps();
>   Future status(const ContainerID& containerId);
> };
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (MESOS-7605) UCR doesn't isolate uts namespace w/ host networking

2017-06-02 Thread James DeFelice (JIRA)
James DeFelice created MESOS-7605:
-

 Summary: UCR doesn't isolate uts namespace w/ host networking
 Key: MESOS-7605
 URL: https://issues.apache.org/jira/browse/MESOS-7605
 Project: Mesos
  Issue Type: Improvement
  Components: containerization
Reporter: James DeFelice


Docker's {{run}} command supports a {{--hostname}} parameter which impacts 
container isolation, even in {{host}} network mode: (via 
https://docs.docker.com/engine/reference/run/)
{quote}
Even in host network mode a container has its own UTS namespace by default. As 
such --hostname is allowed in host network mode and will only change the 
hostname inside the container. Similar to --hostname, the --add-host, --dns, 
--dns-search, and --dns-option options can be used in host network mode.
{quote}
I see no evidence that UCR offers a similar isolation capability.

Related: the {{ContainerInfo}} protobuf has a {{hostname}} field which was 
initially added to support the Docker containerizer's use of the {{--hostname}} 
Docker {{run}} flag.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-6556) Hostname support for the network/cni isolator.

2017-06-02 Thread James DeFelice (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16034991#comment-16034991
 ] 

James DeFelice commented on MESOS-6556:
---

{{hostname}} is only applied when there are container networks present. When 
using host-mode networking, the UTS namespace is not isolated and {{hostname}} 
is not applied to the container. Tracking via 
https://issues.apache.org/jira/browse/MESOS-7605

> Hostname support for the network/cni isolator.
> --
>
> Key: MESOS-6556
> URL: https://issues.apache.org/jira/browse/MESOS-6556
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization
>Reporter: James Peach
>Assignee: James Peach
>Priority: Minor
> Fix For: 1.2.0
>
>
> -Add a {{namespace/uts}} isolator for doing UTS namespace isolation without 
> using the CNI isolator.-
> Update the {{network/cni}} isolator to set the hostname specified by the task 
> info.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (MESOS-7620) GET_VOLUMES call referenced in API docs, but the call doesn't exist

2017-06-05 Thread James DeFelice (JIRA)
James DeFelice created MESOS-7620:
-

 Summary: GET_VOLUMES call referenced in API docs, but the call 
doesn't exist
 Key: MESOS-7620
 URL: https://issues.apache.org/jira/browse/MESOS-7620
 Project: Mesos
  Issue Type: Bug
Reporter: James DeFelice


https://github.com/apache/mesos/blob/d624255394b864ed477838e32f9712d7e63fc86f/include/mesos/v1/master/master.proto#L150

{code}
  // Create persistent volumes on reserved resources. The request is forwarded
  // asynchronously to the Mesos agent where the reserved resources are located.
  // That asynchronous message may not be delivered or creating the volumes at
  // the agent might fail. Volume creation can be verified by sending a
  // `GET_VOLUMES` call.
{code}

It's either a documentation bug, or a missing/overlooked feature.

/cc [~vinodkone] [~jieyu]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-7314) Add offer operations for converting disk resources

2017-06-08 Thread James DeFelice (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043082#comment-16043082
 ] 

James DeFelice commented on MESOS-7314:
---

Does this preserve reservations across conversion ops? If not, it probably 
should...

> Add offer operations for converting disk resources
> --
>
> Key: MESOS-7314
> URL: https://issues.apache.org/jira/browse/MESOS-7314
> Project: Mesos
>  Issue Type: Task
>  Components: master
>Reporter: Jan Schlicht
>Assignee: Jan Schlicht
>  Labels: mesosphere
>
> One should be able to convert {{RAW}} and {{BLOCK}} disk resources into a 
> different types by applying operations to them. The offer operations and the 
> related validation and resource handling needs to be implemented.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (MESOS-7697) Mesos scheduler v1 HTTP API may generate 404 errors for temporary conditions

2017-06-20 Thread James DeFelice (JIRA)
James DeFelice created MESOS-7697:
-

 Summary: Mesos scheduler v1 HTTP API may generate 404 errors for 
temporary conditions
 Key: MESOS-7697
 URL: https://issues.apache.org/jira/browse/MESOS-7697
 Project: Mesos
  Issue Type: Bug
Reporter: James DeFelice


Returning a 404 error for a condition that's a known temporary condition is 
confusing from a client's perspective. A client wants to know how to recover 
from various error conditions. A 404 error condition should be distinct from a 
"server is not yet ready, but will be shortly" condition (which should probably 
be reported as a 503 "unavailable" error).

https://github.com/apache/mesos/blob/72752fc6deb8ebcbfbd5448dc599ef3774339d31/src/scheduler/scheduler.cpp#L593

{code}
if (response->code == process::http::Status::NOT_FOUND) {
  // This could happen if the master libprocess process has not yet set up
  // HTTP routes.
  LOG(WARNING) << "Received '" << response->status << "' ("
   << response->body << ") for " << call.type();
  return;
}
{code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-6638) Update Suppress and Revive to be per-role.

2017-06-26 Thread James DeFelice (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16063092#comment-16063092
 ] 

James DeFelice commented on MESOS-6638:
---

https://github.com/mesos/mesos-go/pull/304#issuecomment-311060962

SUPPRESS and REVIVE were already members of the scheduler Call.Type prior to 
1.2; now that they have corresponding message object types, it's unclear if the 
message objects are actually required if there's no role to set. The docs are 
unclear on this

> Update Suppress and Revive to be per-role.
> --
>
> Key: MESOS-6638
> URL: https://issues.apache.org/jira/browse/MESOS-6638
> Project: Mesos
>  Issue Type: Task
>  Components: scheduler api
>Reporter: Benjamin Mahler
>Assignee: Guangya Liu
> Fix For: 1.2.0
>
>
> The {{SUPPRESS}} and {{REVIVE}} calls need to be updated to be per-role. I.e. 
> Include {{Revive.role}} and {{Suppress.role}} fields, indicating which role 
> the operation is being applied to.
> {{Revive}} and {{Suppress}} messages do not currently exist, so these need to 
> be added. To support the old-style schedulers, we will make the role fields 
> optional.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7211) Document SUPPRESS HTTP call

2017-06-26 Thread James DeFelice (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James DeFelice updated MESOS-7211:
--
Labels: mesosphere newbie  (was: newbie)

> Document SUPPRESS HTTP call
> ---
>
> Key: MESOS-7211
> URL: https://issues.apache.org/jira/browse/MESOS-7211
> Project: Mesos
>  Issue Type: Documentation
>  Components: documentation
>Affects Versions: 1.1.0
>Reporter: Bruce Merry
>Priority: Minor
>  Labels: mesosphere, newbie
>
> The documentation at 
> http://mesos.apache.org/documentation/latest/scheduler-http-api/ doesn't list 
> the SUPPRESS call as one of the call types, but it does seem to be 
> implemented.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-7734) More consistent/strict validation of role names please

2017-06-28 Thread James DeFelice (JIRA)
James DeFelice created MESOS-7734:
-

 Summary: More consistent/strict validation of role names please
 Key: MESOS-7734
 URL: https://issues.apache.org/jira/browse/MESOS-7734
 Project: Mesos
  Issue Type: Improvement
Reporter: James DeFelice


As per the currently implemented role validation rules:
https://github.com/apache/mesos/blob/63e08146aa7aa8efac3928922b6cdef92aa1d2ce/src/common/roles.cpp#L71
 

... the following role names are allowed:
{code}
eng-
eng*
*eng
eng.
eng..
...
..eng
{code}

The `/` character has good validation semantics around it and it's a much less 
confusing character to use when composing hierarchical role names. The `*`, 
`.`, and `-` characters have specific validation rules within narrow context, 
but it's too easy for someone to compose confusing role names using these 
characters.

IMO, validation should severely restrict the context in which the `*` and `.` 
characters are used, as well as implement a symmetrical "endWith" check for `-` 
characters.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-7675) Isolate network ports.

2017-06-29 Thread James DeFelice (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16068462#comment-16068462
 ] 

James DeFelice commented on MESOS-7675:
---

Would this monitor only the network ports advertised as `ports` resources? 
Wondering about interaction with ephemeral ports.

> Isolate network ports.
> --
>
> Key: MESOS-7675
> URL: https://issues.apache.org/jira/browse/MESOS-7675
> Project: Mesos
>  Issue Type: Improvement
>  Components: agent
>Reporter: James Peach
>Assignee: James Peach
>Priority: Minor
>
> If a task uses network ports, there is no isolator that can enforce that it 
> only listens on the ports that it has resources for. Implement a ports 
> isolator that can limit tasks to listen only on allocated TCP ports.
> Roughly, the algorithm for this follows what standard tools like {{lsof}} and 
> {{ss}} do.
> * Find all the listening TCP sockets (using netlink)
> * Index the sockets by their node (from the netlink information)
> * Find all the open sockets on the system (by scanning {{/proc/\*/fd/\*}} 
> links)
> * For each open socket, check whether its node (given in the link target) in 
> the set of listen sockets that we scanned
> * If the socket is a listening socket and the corresponding PID is in the 
> task, send a resource limitation for the task
> Matching pids to tasks depends on using cgroup isolation, otherwise we would 
> have to build a full process tree, which would be nice to avoid.
> Scanning all the open sockets can be avoided by using the {{net_cls}} 
> isolator with kernel + libnl3 patches to publish the socket classid when we 
> find the listening socket.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7752) Command executor still active after terminal task state update.

2017-07-11 Thread James DeFelice (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James DeFelice updated MESOS-7752:
--
Labels: mesosphere  (was: )

> Command executor still active after terminal task state update.
> ---
>
> Key: MESOS-7752
> URL: https://issues.apache.org/jira/browse/MESOS-7752
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.3.0
>Reporter: A. Dukhovniy
>  Labels: mesosphere
>
> Here is a rather simple scenario to reproduce this error:
> * Frameworks starts a task with taskId = _task1_
> * Framework kills _task1_ *successfully* and *acknowledges* TASK_KILLED
> * Framework starts another task with the same _task1_  but receives 
> "_TASK_FAILED (Attempted to run multiple tasks using a "command" executor)_"
> *Note*: this test is racy so this scenario fails occasionally.
> *Here is a full log* from that show a life-cycle of a task id 
> _app-restart-resident-app-with-five-instances.8882bd16-5fdd-11e7-a00e-0242aceef95c_:
> {code:java}
> # Starting...
> WARN [10:51:14 ResidentTaskIntegrationTest-MesosMaster-32782] I0703 
> 10:51:14.476085 14666 master.cpp:3352] Authorizing framework principal 
> 'principal' to launch task 
> app-restart-resident-app-with-five-instances.8882bd16-5fdd-11e7-a00e-0242aceef95c
> WARN [10:51:14 ResidentTaskIntegrationTest-MesosMaster-32782] I0703 
> 10:51:14.510136 14666 master.cpp:4426] Launching task 
> app-restart-resident-app-with-five-instances.8882bd16-5fdd-11e7-a00e-0242aceef95c
>  of framework 76d8f3e7-8f3a-4764-bb7d-2bcf8e85e2be- (marathon) at 
> scheduler-6dbbac16-7355-4a33-aee6-b9697c83e51c@127.0.1.1:61567 with 
> resources...
> WARN [10:51:14 ResidentTaskIntegrationTest-MesosAgent-32788] I0703 
> 10:51:14.513908 14697 slave.cpp:2118] Queued task 
> 'app-restart-resident-app-with-five-instances.8882bd16-5fdd-11e7-a00e-0242aceef95c'
>  for executor 
> 'app-restart-resident-app-with-five-instances.8882bd16-5fdd-11e7-a00e-0242aceef95c'
>  of framework 76d8f3e7-8f3a-4764-bb7d-2bcf8e85e2be-
> WARN [10:51:15 ResidentTaskIntegrationTest-MesosMaster-32782] I0703 
> 10:51:15.011696 14671 master.cpp:6222] Forwarding status update TASK_RUNNING 
> (UUID: ed2d0475-9d83-4e09-9f54-5b4d323e4558) for task 
> app-restart-resident-app-with-five-instances.8882bd16-5fdd-11e7-a00e-0242aceef95c
>  of framework 76d8f3e7-8f3a-4764-bb7d-2bcf8e85e2be-
> WARN [10:51:15 ResidentTaskIntegrationTest-MesosMaster-32782] I0703 
> 10:51:15.036391 14671 master.cpp:5092] Processing ACKNOWLEDGE call 
> ed2d0475-9d83-4e09-9f54-5b4d323e4558 for task 
> app-restart-resident-app-with-five-instances.8882bd16-5fdd-11e7-a00e-0242aceef95c
>  of framework 76d8f3e7-8f3a-4764-bb7d-2bcf8e85e2be- (marathon) at 
> scheduler-6dbbac16-7355-4a33-aee6-b9697c83e51c@127.0.1.1:61567 on agent 
> 76d8f3e7-8f3a-4764-bb7d-2bcf8e85e2be-S0
> {code}
> {code:java}
> # Killing...
> DEBUG[10:51:15 ResidentTaskIntegrationTest-LocalMarathon-32800] WARN 
> [10:51:15 KillAction$] Killing known task 
> [app-restart-resident-app-with-five-instances.8882bd16-5fdd-11e7-a00e-0242aceef95c]
>  of instance instance 
> [app-restart-resident-app-with-five-instances.marathon-8882bd16-5fdd-11e7-a00e-0242aceef95c]
> WARN [10:51:15 ResidentTaskIntegrationTest-MesosAgent-32788] I0703 
> 10:51:15.196702 14697 slave.cpp:3816] Handling status update TASK_KILLED 
> (UUID: f7e9d0bc-726c-43aa-9ddc-3b082a68642e) for task 
> app-restart-resident-app-with-five-instances.8882bd16-5fdd-11e7-a00e-0242aceef95c
>  of framework 76d8f3e7-8f3a-4764-bb7d-2bcf8e85e2be- from 
> executor(1)@172.16.10.121:35184
> WARN [10:51:15 ResidentTaskIntegrationTest-MesosAgent-32788] I0703 
> 10:51:15.197676 14697 slave.cpp:4166] Sending acknowledgement for status 
> update TASK_KILLED (UUID: f7e9d0bc-726c-43aa-9ddc-3b082a68642e) for task 
> app-restart-resident-app-with-five-instances.8882bd16-5fdd-11e7-a00e-0242aceef95c
>  of framework 76d8f3e7-8f3a-4764-bb7d-2bcf8e85e2be- to 
> executor(1)@172.16.10.121:35184
> WARN [10:51:15 ResidentTaskIntegrationTest-MesosMaster-32782] I0703 
> 10:51:15.198299 14671 master.cpp:6154] Status update TASK_KILLED (UUID: 
> f7e9d0bc-726c-43aa-9ddc-3b082a68642e) for task 
> app-restart-resident-app-with-five-instances.8882bd16-5fdd-11e7-a00e-0242aceef95c
>  of framework 76d8f3e7-8f3a-4764-bb7d-2bcf8e85e2be- from agent 
> 76d8f3e7-8f3a-4764-bb7d-2bcf8e85e2be-S0 at slave(1)@172.16.10.121:32788 
> (172.16.10.121)
> DEBUG[10:51:15 ResidentTaskIntegrationTest-LocalMarathon-32800] INFO 
> [10:51:15 MarathonScheduler] Received status update for task 
> app-restart-resident-app-with-five-instances.8882bd16-5fdd-11e7-a00e-0242aceef95c:
>  TASK_KILLED (Command terminated with signal Terminated)
> WARN [10:51:15 ResidentTaskIntegrationTest-MesosMaster-32782] I0703 
> 10:51:15.216081 14671 master.cp

[jira] [Updated] (MESOS-7171) Mesos Containerizer Change Size of SHM

2017-07-11 Thread James DeFelice (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James DeFelice updated MESOS-7171:
--
Labels: mesosphere  (was: )

> Mesos Containerizer Change Size of SHM
> --
>
> Key: MESOS-7171
> URL: https://issues.apache.org/jira/browse/MESOS-7171
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Miguel Bernadin
>Assignee: Joseph Wu
>Priority: Minor
>  Labels: mesosphere
>
> like the ability to adjust the size of the shared memory device just like 
> this can be performed on docker.
> For example: To be able to change this on docker you can specify how much 
> space you would like to allocate as a parameter in the app definition in 
> marathon.
> {code}
>   "parameters": [
> {
>   "key": "shm-size",
>   "value": "256mb"
> }
> {code}
> As you can see below, here is an example of a container running and how much 
> space is available on disk reflecting this change.
> Modified Parameter Container:
> {code}
> {
>   "id": "/ubuntu-withshm",
>   "cmd": "sleep 1000\n",
>   "cpus": 1,
>   "mem": 128,
>   "disk": 0,
>   "instances": 1,
>   "container": {
> "type": "DOCKER",
> "volumes": [],
> "docker": {
>   "image": "ubuntu",
>   "network": "HOST",
>   "privileged": false,
>   "parameters": [
> {
>   "key": "shm-size",
>   "value": "256mb"
> }
>   ],
>   "forcePullImage": false
> }
>   },
>   "portDefinitions": [
> {
>   "port": 10005,
>   "protocol": "tcp",
>   "labels": {}
> }
>   ]
> }
> {code}
> Modified Parameter Container:
> {code}
> core@ip-10-0-0-19 ~ $ docker exec -it a818cf2277a5 bash
> root@ip-10-0-0-19:/# df -h
> Filesystem  Size  Used Avail Use% Mounted on
> overlay  37G  2.0G   33G   6% /
> tmpfs   7.4G 0  7.4G   0% /dev
> tmpfs   7.4G 0  7.4G   0% /sys/fs/cgroup
> /dev/xvdb37G  2.0G   33G   6% /etc/hostname
> shm 256M 0  256M   0% /dev/shm
> {code}
> Standard Container:
> {code}
> {
>   "id": "/ubuntu-withoutshm",
>   "cmd": "sleep 1",
>   "cpus": 1,
>   "mem": 128,
>   "disk": 0,
>   "instances": 1,
>   "container": {
> "type": "DOCKER",
> "volumes": [],
> "docker": {
>   "image": "ubuntu",
>   "network": "HOST",
>   "privileged": false,
>   "parameters": [],
>   "forcePullImage": false
> }
>   },
>   "portDefinitions": [
> {
>   "port": 10006,
>   "protocol": "tcp",
>   "labels": {}
> }
>   ]
> }
> {code}
> Standard Container:
> {code}
> root@ip-10-0-0-19:/# exit
> exit
> core@ip-10-0-0-19 ~ $ docker exec -it c85433062e78 bash
> root@ip-10-0-0-19:/# df -h
> Filesystem  Size  Used Avail Use% Mounted on
> overlay  37G  2.0G   33G   6% /
> tmpfs   7.4G 0  7.4G   0% /dev
> tmpfs   7.4G 0  7.4G   0% /sys/fs/cgroup
> /dev/xvdb37G  2.0G   33G   6% /etc/hostname
> shm  64M 0   64M   0% /dev/shm
> {code}
> How can this be done on mesos containerizer?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-7605) UCR doesn't isolate uts namespace w/ host networking

2017-07-11 Thread James DeFelice (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16082706#comment-16082706
 ] 

James DeFelice commented on MESOS-7605:
---

Re-opening this ticket for further discussion.

If there are no container networks, there is no UTS namespace isolation, as per:

https://github.com/apache/mesos/blob/9b69c09310cdb6d7cfca1284f60c3f1b422c77cc/src/slave/containerizer/mesos/isolators/network/cni/cni.cpp#L655

without such isolation, calls to `sethostname` from a container will impact the 
host netns, as per:

https://linux.die.net/man/2/sethostname

and

https://linux.die.net/man/1/unshare

{quote}
UTS namespace

setting hostname, domainname will not affect rest of the system (CLONE_NEWUTS 
flag),
{quote}

This is distinctly different from the Docker experience. It also implies that 
it's impossible to give a container permission to **bind** to a host network 
port without also giving it permission to **change the host's network name**. 
This feels like a security hole to me.

> UCR doesn't isolate uts namespace w/ host networking
> 
>
> Key: MESOS-7605
> URL: https://issues.apache.org/jira/browse/MESOS-7605
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization
>Reporter: James DeFelice
>  Labels: mesosphere
>
> Docker's {{run}} command supports a {{--hostname}} parameter which impacts 
> container isolation, even in {{host}} network mode: (via 
> https://docs.docker.com/engine/reference/run/)
> {quote}
> Even in host network mode a container has its own UTS namespace by default. 
> As such --hostname is allowed in host network mode and will only change the 
> hostname inside the container. Similar to --hostname, the --add-host, --dns, 
> --dns-search, and --dns-option options can be used in host network mode.
> {quote}
> I see no evidence that UCR offers a similar isolation capability.
> Related: the {{ContainerInfo}} protobuf has a {{hostname}} field which was 
> initially added to support the Docker containerizer's use of the 
> {{--hostname}} Docker {{run}} flag.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-7839) io switchboard: clarify expected behavior when using TTYInfo with the default executor of a TaskGroup

2017-07-27 Thread James DeFelice (JIRA)
James DeFelice created MESOS-7839:
-

 Summary: io switchboard: clarify expected behavior when using 
TTYInfo with the default executor of a TaskGroup
 Key: MESOS-7839
 URL: https://issues.apache.org/jira/browse/MESOS-7839
 Project: Mesos
  Issue Type: Bug
Affects Versions: 1.2.1
Reporter: James DeFelice


I executed a LaunchGroup operation with an Executor of a DEFAULT type and with 
TTYInfo set to a non-empty protobuf. The tasks of the group did not specify a 
ContainerInfo.

Mesos "successfully" launched the task group and in the executor sandbox stderr 
reported
{code}
The io switchboard server failed: Failed redirecting stdout: Input/output error
{code}

... which seems relatively uninformative. Mesos also returned TASK_RUNNING 
followed by TASK_FINISHED for the tasks in the launched group. This wasn't what 
I expected: my goal was to launch a pod and have a TTY attached to the first 
task in the group.

After discussing with [~klueska] the solution to my problem was to specify 
TTYInfo for the container of the task within the group, not on the group's 
executor. But we agreed that Mesos could probably exhibit better behavior in 
the initial scenario that I tested.

Some (mutually exclusive) possibilities for alternate Mesos behavior:

(a) fail-fast: using the Default Executor with a task group doesn't support 
TTYInfo so Mesos should just refuse to launch the task group (and return an 
appropriate error code and message w/ a reasonable explanation).

(b) support TTYInfo when using the Default Executor with a task group. The use 
case for this is unclear.

(c) when using TTYInfo with the DefaultExecutor and task group, attach the TTY 
to the first task in the group.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-7492) Introduce a daemon manager in the agent.

2017-07-28 Thread James DeFelice (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16104551#comment-16104551
 ] 

James DeFelice commented on MESOS-7492:
---

Could we start with an even more minimal design and (a) get rid of the health 
check fields (poll_interval, initial_delay, and check), and (b) eliminate the 
auto-restart feature? We can add these later if/when needed as requirements and 
user stories evolve.

There's lots of supervision tooling to choose from already and it's not clear 
to me that Mesos should spend the time reinventing this wheel right now. Also, 
supporting run-once daemon tasks actually supports **both** run-once and 
run-forever models (run-forever tasks just need **some** supervisor process 
above the actual service -- that supervisor doesn't need to be Mesos).

> Introduce a daemon manager in the agent.
> 
>
> Key: MESOS-7492
> URL: https://issues.apache.org/jira/browse/MESOS-7492
> Project: Mesos
>  Issue Type: Task
>Reporter: Jie Yu
>Assignee: Joseph Wu
>  Labels: mesosphere, storage
>
> Once we have standalone container support from the containerizer, we should 
> consider adding a daemon manager inside the agent. It'll be like 'monit', 
> 'upstart' or 'systemd', but with very limited functionalities. For instance, 
> as a start, the manager will simply always restart the daemons if the daemon 
> fails. It'll also try to cleanup unknown daemons.
> This feature will be used to manage CSI plugin containers on the agent.
> The daemon manager should have an interface allowing operators to "register" 
> a daemon with a name and a config of the daemon. The daemon manager is 
> responsible for restarting the daemon if it crashes until some one explicitly 
> "unregister" it. Some simple backoff and health check functionality should be 
> provided.
> We probably need a small design doc for this.
> {code}
> message DaemonConfig {
>   optional ContainerInfo container;
>   optional CommandInfo command;
>   optional uint32 poll_interval;
>   optional uint32 initial_delay;
>   optional CheckInfo check; // For health check.
> }
> class DaemonManager
> {
> public:
>   Future register(
> const ContainerID& containerId,
> const DaemonConfig& config;
>   Future unregister(const ContainerID& containerId);
>   Future> ps();
>   Future status(const ContainerID& containerId);
> };
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-7492) Introduce a daemon manager in the agent.

2017-07-28 Thread James DeFelice (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16104557#comment-16104557
 ] 

James DeFelice commented on MESOS-7492:
---

Instead of health checks/auto-restart I'd actually like to see a way to adjust 
the "kill" signal that an agent will send to a daemon in order to shut it down. 
Especially if we want to support containerizing the various supervision systems 
that already exist in the wild (s6, systemd, etc).

> Introduce a daemon manager in the agent.
> 
>
> Key: MESOS-7492
> URL: https://issues.apache.org/jira/browse/MESOS-7492
> Project: Mesos
>  Issue Type: Task
>Reporter: Jie Yu
>Assignee: Joseph Wu
>  Labels: mesosphere, storage
>
> Once we have standalone container support from the containerizer, we should 
> consider adding a daemon manager inside the agent. It'll be like 'monit', 
> 'upstart' or 'systemd', but with very limited functionalities. For instance, 
> as a start, the manager will simply always restart the daemons if the daemon 
> fails. It'll also try to cleanup unknown daemons.
> This feature will be used to manage CSI plugin containers on the agent.
> The daemon manager should have an interface allowing operators to "register" 
> a daemon with a name and a config of the daemon. The daemon manager is 
> responsible for restarting the daemon if it crashes until some one explicitly 
> "unregister" it. Some simple backoff and health check functionality should be 
> provided.
> We probably need a small design doc for this.
> {code}
> message DaemonConfig {
>   optional ContainerInfo container;
>   optional CommandInfo command;
>   optional uint32 poll_interval;
>   optional uint32 initial_delay;
>   optional CheckInfo check; // For health check.
> }
> class DaemonManager
> {
> public:
>   Future register(
> const ContainerID& containerId,
> const DaemonConfig& config;
>   Future unregister(const ContainerID& containerId);
>   Future> ps();
>   Future status(const ContainerID& containerId);
> };
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-7974) Accept "application/recordio" type is rejected for master operator API SUBSCRIBE call

2017-09-13 Thread James DeFelice (JIRA)
James DeFelice created MESOS-7974:
-

 Summary: Accept "application/recordio" type is rejected for master 
operator API SUBSCRIBE call
 Key: MESOS-7974
 URL: https://issues.apache.org/jira/browse/MESOS-7974
 Project: Mesos
  Issue Type: Bug
Affects Versions: 1.2.1
Reporter: James DeFelice


The agent operator API supports for "application/recordio" for things like 
attach-container-output, which streams objects back to the caller. I expected 
the master operator API SUBSCRIBE call to work the same way, w/ 
Accept/Content-Type headers for "recordio" and 
Message-Accept/Message-Content-Type headers for json (or protobuf). This was 
not the case.

Looking again at the master operator API documentation, SUBSCRIBE docs 
illustrate usage Accept and Content-Type headers for the "application/json" 
type. Not a "recordio" type. So my experience, as per the docs, seems expected. 
However, this is counter-intuitive since the whole point of adding the new 
Message-prefixed headers was to help callers consistently request (and 
differentiate) streaming responses from non-streaming responses in the v1 API.

Please fix the master operator API implementation to also support the 
Message-prefixed headers w/ Accept/Content-Type set to "recordio".

Observed on ubuntu w/ mesos package version 1.2.1-2.0.1



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-7974) Accept "application/recordio" type is rejected for master operator API SUBSCRIBE call

2017-09-13 Thread James DeFelice (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16164930#comment-16164930
 ] 

James DeFelice commented on MESOS-7974:
---

xref MESOS-6936

> Accept "application/recordio" type is rejected for master operator API 
> SUBSCRIBE call
> -
>
> Key: MESOS-7974
> URL: https://issues.apache.org/jira/browse/MESOS-7974
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.2.1
>Reporter: James DeFelice
>  Labels: mesosphere
>
> The agent operator API supports for "application/recordio" for things like 
> attach-container-output, which streams objects back to the caller. I expected 
> the master operator API SUBSCRIBE call to work the same way, w/ 
> Accept/Content-Type headers for "recordio" and 
> Message-Accept/Message-Content-Type headers for json (or protobuf). This was 
> not the case.
> Looking again at the master operator API documentation, SUBSCRIBE docs 
> illustrate usage Accept and Content-Type headers for the "application/json" 
> type. Not a "recordio" type. So my experience, as per the docs, seems 
> expected. However, this is counter-intuitive since the whole point of adding 
> the new Message-prefixed headers was to help callers consistently request 
> (and differentiate) streaming responses from non-streaming responses in the 
> v1 API.
> Please fix the master operator API implementation to also support the 
> Message-prefixed headers w/ Accept/Content-Type set to "recordio".
> Observed on ubuntu w/ mesos package version 1.2.1-2.0.1



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-7976) v1/scheduler: Revive/Suppress role field should be marked experimental for 1.2.x branch

2017-09-13 Thread James DeFelice (JIRA)
James DeFelice created MESOS-7976:
-

 Summary: v1/scheduler: Revive/Suppress role field should be marked 
experimental for 1.2.x branch
 Key: MESOS-7976
 URL: https://issues.apache.org/jira/browse/MESOS-7976
 Project: Mesos
  Issue Type: Bug
Affects Versions: 1.2.2, 1.2.1, 1.2.0
Reporter: James DeFelice
Assignee: Benjamin Mahler


The role field of the v1 scheduler API's Revive and Suppress call should have 
been marked as experimental since it was part of the experimental MULTI-ROLE 
feature. The field has replaced in 1.3.x by a "repeated string roles" field.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-8060) Introduce first class 'profile' for disk resources.

2017-10-09 Thread James DeFelice (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16197488#comment-16197488
 ] 

James DeFelice commented on MESOS-8060:
---

https://reviews.apache.org/r/62820/

> Introduce first class 'profile' for disk resources.
> ---
>
> Key: MESOS-8060
> URL: https://issues.apache.org/jira/browse/MESOS-8060
> Project: Mesos
>  Issue Type: Task
>Reporter: Jie Yu
>Assignee: Jie Yu
>
> This is similar to storage classes. Instead of adding a bunch of storage 
> backend specific parameters (e.g., rotational, type, speed, etc.) into the 
> disk resources, and asking the frameworks to make scheduling decisions based 
> on those vendor specific parameters. We propose to use a level of indirection 
> here.
> The operator will setup mappings between a profile name to a set of vendor 
> specific disk parameters. The framework will do disk selection based on 
> profile names.
> The storage resource provider will provide a hook allowing operators to 
> customize the profile name assignment for disk resources.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8078) Some fields went missing with no replacement in api/v1

2017-10-20 Thread James DeFelice (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James DeFelice updated MESOS-8078:
--
Labels: mesosphere  (was: )

> Some fields went missing with no replacement in api/v1
> --
>
> Key: MESOS-8078
> URL: https://issues.apache.org/jira/browse/MESOS-8078
> Project: Mesos
>  Issue Type: Story
>  Components: HTTP API
>Reporter: Dmitrii Rozhkov
>  Labels: mesosphere
>
> Hi friends, 
> These fields are available via the state.json but went missing in the v1 of 
> the API:
> leader_info
> start_time
> elected_time
> As we're showing them on the Overview page of the DC/OS UI, yet would like 
> not be using state.json, it would be great to have them somewhere in V1.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8169) master validation incorrectly rejects slaves, buggy executorID checking

2017-11-03 Thread James DeFelice (JIRA)
James DeFelice created MESOS-8169:
-

 Summary: master validation incorrectly rejects slaves, buggy 
executorID checking
 Key: MESOS-8169
 URL: https://issues.apache.org/jira/browse/MESOS-8169
 Project: Mesos
  Issue Type: Bug
Affects Versions: 1.4.0
Reporter: James DeFelice
Priority: Major


proposed fix: https://github.com/apache/mesos/pull/248

I observed this in my environment, where I had two frameworks that used the 
same ExecutorID and then triggered a master failover. The master refuses to 
reregister the slave because it's not considering the owning-framework of the 
ExecutorID when computing ExecutorID uniqueness, and concludes (incorrectly) 
that there's an erroneous duplicate executor ID:

{code}
W1103 00:33:42.509891 19638 master.cpp:6008] Dropping re-registration of agent 
at slave(1)@10.2.0.7:5051 because it sent an invalid re-registration: Executor 
has a duplicate ExecutorID 'default'
{code}

(yes, "default" is probably a terrible name for an ExecutorID - that's a 
separate discussion!)

/cc [~neilc]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-8169) master validation incorrectly rejects slaves, buggy executorID checking

2017-11-03 Thread James DeFelice (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16237738#comment-16237738
 ] 

James DeFelice commented on MESOS-8169:
---

/cc [~jamespeach]

> master validation incorrectly rejects slaves, buggy executorID checking
> ---
>
> Key: MESOS-8169
> URL: https://issues.apache.org/jira/browse/MESOS-8169
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.4.0
>Reporter: James DeFelice
>Priority: Major
>  Labels: mesosphere
>
> proposed fix: https://github.com/apache/mesos/pull/248
> I observed this in my environment, where I had two frameworks that used the 
> same ExecutorID and then triggered a master failover. The master refuses to 
> reregister the slave because it's not considering the owning-framework of the 
> ExecutorID when computing ExecutorID uniqueness, and concludes (incorrectly) 
> that there's an erroneous duplicate executor ID:
> {code}
> W1103 00:33:42.509891 19638 master.cpp:6008] Dropping re-registration of 
> agent at slave(1)@10.2.0.7:5051 because it sent an invalid re-registration: 
> Executor has a duplicate ExecutorID 'default'
> {code}
> (yes, "default" is probably a terrible name for an ExecutorID - that's a 
> separate discussion!)
> /cc [~neilc]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8171) Using a failoverTimeout of 0 with Mesos native scheduler client can result in infinite subscribe loop

2017-11-04 Thread James DeFelice (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James DeFelice updated MESOS-8171:
--
Labels: mesosphere  (was: )

> Using a failoverTimeout of 0 with Mesos native scheduler client can result in 
> infinite subscribe loop
> -
>
> Key: MESOS-8171
> URL: https://issues.apache.org/jira/browse/MESOS-8171
> Project: Mesos
>  Issue Type: Bug
>  Components: c++ api, java api, scheduler driver
>Affects Versions: 1.1.3, 1.2.2, 1.3.1, 1.4.0
>Reporter: Tim Harper
>Priority: Minor
>  Labels: mesosphere
>
> Over the past year, the Marathon team has been plagued with an issue that 
> hits our CI builds periodically in which the scheduler driver enters a tight 
> loop, sending 10,000s of SUBSCRIBE calls to the master per second. I turned 
> on debug logging for the client and the server, and it pointed to an issue 
> with the {{doReliableRegistration}} method in sched.cpp. Here's the logs:
> {code}
> WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 
> 05:39:39.099815 13397 process.cpp:1383] libprocess is initialized on 
> 127.0.1.1:60957 with 8 worker threads
> WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 
> 05:39:39.118237 13397 logging.cpp:199] Logging to STDERR
> WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 
> 05:39:39.128921 13416 sched.cpp:232] Version: 1.4.0
> WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 
> 05:39:39.151785 13791 group.cpp:341] Group process 
> (zookeeper-group(1)@127.0.1.1:60957) connected to ZooKeeper
> WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 
> 05:39:39.151823 13791 group.cpp:831] Syncing group operations: queue size 
> (joins, cancels, datas) = (0, 0, 0)
> WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 
> 05:39:39.151837 13791 group.cpp:419] Trying to create path '/mesos' in 
> ZooKeeper
> WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 
> 05:39:39.152586 13791 group.cpp:758] Found non-sequence node 'log_replicas' 
> at '/mesos' in ZooKeeper
> WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 
> 05:39:39.152662 13791 detector.cpp:152] Detected a new leader: (id='0')
> WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 
> 05:39:39.152762 13791 group.cpp:700] Trying to get 
> '/mesos/json.info_00' in ZooKeeper
> WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 
> 05:39:39.157148 13791 zookeeper.cpp:262] A new leading master 
> (UPID=master@172.16.10.95:32856) is detected
> WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 
> 05:39:39.157347 13787 sched.cpp:336] New master detected at 
> master@172.16.10.95:32856
> WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 
> 05:39:39.157557 13787 sched.cpp:352] No credentials provided. Attempting to 
> register without authentication
> WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 
> 05:39:39.157565 13787 sched.cpp:836] Sending SUBSCRIBE call to 
> master@172.16.10.95:32856
> WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 
> 05:39:39.157635 13787 sched.cpp:869] Will retry registration in 0ns if 
> necessary
> WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 
> 05:39:39.158979 13785 sched.cpp:836] Sending SUBSCRIBE call to 
> master@172.16.10.95:32856
> WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 
> 05:39:39.159029 13785 sched.cpp:869] Will retry registration in 0ns if 
> necessary
> WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 
> 05:39:39.159265 13790 sched.cpp:836] Sending SUBSCRIBE call to 
> master@172.16.10.95:32856
> WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 
> 05:39:39.159303 13790 sched.cpp:869] Will retry registration in 0ns if 
> necessary
> WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 
> 05:39:39.159479 13786 sched.cpp:836] Sending SUBSCRIBE call to 
> master@172.16.10.95:32856
> WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 
> 05:39:39.159521 13786 sched.cpp:869] Will retry registration in 0ns if 
> necessary
> WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 
> 05:39:39.159622 13788 sched.cpp:836] Sending SUBSCRIBE call to 
> master@172.16.10.95:32856
> WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 
> 05:39:39.159658 13788 sched.cpp:869] Will retry registration in 0ns if 
> necessary
> WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 
> 05:39:39.159749 13789 sched.cpp:836] Sending SUBSCRIBE call to 
> master@172.16.10.95:32856
> WARN [05:39:39 EventsIntegrationTest-LocalMarathon-32858] I1104 
> 05:39:39.159785 13789 sched.cpp:869] Will ret

[jira] [Created] (MESOS-8237) Mesos injects Resource.allocation_info for all resource offers, regardless of MULTI_ROLE opt-in

2017-11-15 Thread James DeFelice (JIRA)
James DeFelice created MESOS-8237:
-

 Summary: Mesos injects Resource.allocation_info for all resource 
offers, regardless of MULTI_ROLE opt-in
 Key: MESOS-8237
 URL: https://issues.apache.org/jira/browse/MESOS-8237
 Project: Mesos
  Issue Type: Bug
Reporter: James DeFelice
Assignee: Benjamin Mahler


In support of MULTI_ROLE capable frameworks, a Resource.allocation_info field 
was added and the Resource math of the Mesos library was updated to check for 
matching allocation_info when checking for (in)equality, addability, 
subtractability, containment, etc. To compensate for these changes, the demo 
frameworks of Mesos were updated to set the allocation_info for Resource 
objects during the "matching phase" in which offers' resources are evaluated in 
order for the framework to launch tasks. The Mesos demo frameworks NEEDED to be 
updated because the Resource algebra within Mesos now depended on matching 
allocation_info fields of Resource objects when executing algebraic operations. 
See 
https://github.com/apache/mesos/commit/c20744a9976b5e83698e9c6062218abb4d2e6b25#diff-298cc6a77862b7ff3422cd06c215ef28R91
 .

This poses a unique problem for **external** libraries that both aim to support 
various frameworks, some that DO and some that DO NOT opt-in to the MULTI_ROLE 
capability; specifically those external libraries that implement Resource 
algebra that's consistent with what Mesos implements internally. One such 
example of a library is mesos-go, though there are undoubtedly others. The 
problem can be explained via this scenario: 
{quote}
Flo's mesos-go framework is running well, it doesn't opt-in to MULTI_ROLE 
because it doesn't need multiple roles. His framework runs on a version of 
Mesos that existed prior to integration of MULTI_ROLE support. His DC operator 
upgrades the mesos cluster to the latest version. Flo rebuilds his framework on 
the latest version of mesos-go and re-launches it on the cluster. He observes 
that his framework receives offers, but rejects ALL of them. Digging into the 
code he realizes that Mesos is injecting allocation_info into Resource objects 
being offered to his framework, and mesos-go considers allocation_info when 
comparing Resource objects (because it's MULTI_ROLE compatible now), but his 
framework doesn't take this into consideration when preparing its own Resource 
objects prior to the "resource matching phase". The consequence is that Flo's 
framework is trying to match against Resources that will never align because 
his framework isn't setting an allocation_info that might possibly match the 
allocation_info that Mesos is always injecting - regardless of the MULTI_ROLE 
capability (or lack thereof in this case) of his framework.
{quote}

If Mesos were to strip the allocation_info from Resource objects, prior to 
offering them to non-multi-role frameworks, then the problem illustrated above 
would go away.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-3601) Formalize all headers and metadata for HTTP API Event Stream

2016-11-10 Thread James DeFelice (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James DeFelice updated MESOS-3601:
--
Labels: api http mesosphere wireprotocol  (was: api http wireprotocol)

> Formalize all headers and metadata for HTTP API Event Stream
> 
>
> Key: MESOS-3601
> URL: https://issues.apache.org/jira/browse/MESOS-3601
> Project: Mesos
>  Issue Type: Improvement
>Affects Versions: 0.24.0
> Environment: Mesos 0.24.0
>Reporter: Ben Whitehead
>  Labels: api, http, mesosphere, wireprotocol
>
> From an HTTP standpoint the current set of headers returned when connecting 
> to the HTTP scheduler API are insufficient. 
> {code:title=current headers}
> HTTP/1.1 200 OK
> Transfer-Encoding: chunked
> Date: Wed, 30 Sep 2015 21:07:16 GMT
> Content-Type: application/json
> {code}
> Since the response from mesos is intended to function as a stream 
> {{Connection: keep-alive}} should be specified so that the connection can 
> remain open.
> If RecordIO is going to be applied to the messages, the headers should 
> include the information necessary for a client to be able to detect RecordIO 
> and setup it response handlers appropriately.
> How RecordIO is expressed will come down to the semantics of what is actually 
> "Returned" as the response from {{POST /api/v1/scheduler}}.
> h4. Proposal
> One approach would be to leverage http as much as possible, having a client 
> specify an {{Accept-Encoding}} along with the {{Accept}} header to indicate 
> that it can handle RecordIO {{Content-Encoding}} of {{Content-Type}} 
> messages.  (This approach allows for things like gzip to be woven in fairly 
> easily in the future)
> For this approach I would expect the following:
> {code:title=Request}
> POST /api/v1/scheduler HTTP/1.1
> Host: localhost:5050
> Accept: application/x-protobuf
> Accept-Encoding: recordio
> Content-Type: application/x-protobuf
> Content-Length: 35
> User-Agent: RxNetty Client
> {code}
> {code:title=Response}
> HTTP/1.1 200 OK
> Connection: keep-alive
> Transfer-Encoding: chunked
> Content-Type: application/x-protobuf
> Content-Encoding: recordio
> Cache-Control: no-transform
> {code}
> When Content-Encoding is used it is recommended to set {{Cache-Control: 
> no-transform}} to signal to any proxies that no transformation should be 
> applied to the the content encoding [Section 14.11 RFC 
> 2616|http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.11].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3601) Formalize all headers and metadata for HTTP API Event Stream

2016-11-10 Thread James DeFelice (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15655531#comment-15655531
 ] 

James DeFelice commented on MESOS-3601:
---

what about something like this?
{code}
Content-type: application/json; streamFormat=record-io
{code}

(a) maintains backwards compat; and
(b) provides a way for clients to determine if it's a a series/stream of json 
objects (and how they're packaged) vs a single object (would not include a 
streamFormat parameter here)

there's already the concept of a "stream" in the via (via the Mesos-Stream-Id 
header)

> Formalize all headers and metadata for HTTP API Event Stream
> 
>
> Key: MESOS-3601
> URL: https://issues.apache.org/jira/browse/MESOS-3601
> Project: Mesos
>  Issue Type: Improvement
>Affects Versions: 0.24.0
> Environment: Mesos 0.24.0
>Reporter: Ben Whitehead
>  Labels: api, http, mesosphere, wireprotocol
>
> From an HTTP standpoint the current set of headers returned when connecting 
> to the HTTP scheduler API are insufficient. 
> {code:title=current headers}
> HTTP/1.1 200 OK
> Transfer-Encoding: chunked
> Date: Wed, 30 Sep 2015 21:07:16 GMT
> Content-Type: application/json
> {code}
> Since the response from mesos is intended to function as a stream 
> {{Connection: keep-alive}} should be specified so that the connection can 
> remain open.
> If RecordIO is going to be applied to the messages, the headers should 
> include the information necessary for a client to be able to detect RecordIO 
> and setup it response handlers appropriately.
> How RecordIO is expressed will come down to the semantics of what is actually 
> "Returned" as the response from {{POST /api/v1/scheduler}}.
> h4. Proposal
> One approach would be to leverage http as much as possible, having a client 
> specify an {{Accept-Encoding}} along with the {{Accept}} header to indicate 
> that it can handle RecordIO {{Content-Encoding}} of {{Content-Type}} 
> messages.  (This approach allows for things like gzip to be woven in fairly 
> easily in the future)
> For this approach I would expect the following:
> {code:title=Request}
> POST /api/v1/scheduler HTTP/1.1
> Host: localhost:5050
> Accept: application/x-protobuf
> Accept-Encoding: recordio
> Content-Type: application/x-protobuf
> Content-Length: 35
> User-Agent: RxNetty Client
> {code}
> {code:title=Response}
> HTTP/1.1 200 OK
> Connection: keep-alive
> Transfer-Encoding: chunked
> Content-Type: application/x-protobuf
> Content-Encoding: recordio
> Cache-Control: no-transform
> {code}
> When Content-Encoding is used it is recommended to set {{Cache-Control: 
> no-transform}} to signal to any proxies that no transformation should be 
> applied to the the content encoding [Section 14.11 RFC 
> 2616|http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.11].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6595) As a Mesos user I want to launch processes that will run on every node in the cluster

2016-11-15 Thread James DeFelice (JIRA)
James DeFelice created MESOS-6595:
-

 Summary: As a Mesos user I want to launch processes that will run 
on every node in the cluster
 Key: MESOS-6595
 URL: https://issues.apache.org/jira/browse/MESOS-6595
 Project: Mesos
  Issue Type: Story
Reporter: James DeFelice


Some applicable use cases:
- log collection
- metrics and monitoring
- service discovery

It might also be useful to break this functionality down into: daemon processes 
for master nodes vs. daemon processes for agent nodes.

There was some initial discussion and back-of-the-napkin design for this at 
Mesoscon this past year (with an emphasis on agent nodes) but I'm not aware 
that anything significant materialized from that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6595) As a Mesos user I want to launch processes that will run on every node in the cluster

2016-11-15 Thread James DeFelice (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15668408#comment-15668408
 ] 

James DeFelice commented on MESOS-6595:
---

Marathon users have been asking for this feature for .. years: 
https://github.com/mesosphere/marathon/issues/846

> As a Mesos user I want to launch processes that will run on every node in the 
> cluster
> -
>
> Key: MESOS-6595
> URL: https://issues.apache.org/jira/browse/MESOS-6595
> Project: Mesos
>  Issue Type: Story
>Reporter: James DeFelice
>  Labels: mesosphere
>
> Some applicable use cases:
> - log collection
> - metrics and monitoring
> - service discovery
> It might also be useful to break this functionality down into: daemon 
> processes for master nodes vs. daemon processes for agent nodes.
> There was some initial discussion and back-of-the-napkin design for this at 
> Mesoscon this past year (with an emphasis on agent nodes) but I'm not aware 
> that anything significant materialized from that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-1571) Signal escalation timeout is not configurable

2015-02-18 Thread James DeFelice (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14325794#comment-14325794
 ] 

James DeFelice commented on MESOS-1571:
---

In the kubernetes-mesos framework, the executor Shutdown() implementation 
currently force-stop's the containers it's managing (which, to my 
understanding, sends SIGKILL). It manages Docker containers, which are normally 
given 10s to shut down gracefully before Docker sends a SIGKILL. That 10s 
timeout is not compatible with the default slave flag 
`executor_shudown_grace_timeout` value of mesos (3s). However if I change the 
value of that timeout to 20s to give the executor more time to gracefully kill 
things there's no way for the executor to reason about that because it has no 
idea of how much time it actually has.

As a workaround I've considered looking up the slave PID from the environment 
and querying its state.json for the startup flags, and trying to make a 
decision based on that. That approach seems somewhat hackish and I'd much 
rather do something nicer.

It would be great to have an environment var 
`MESOS_EXECUTOR_SHUTDOWN_GRACE_PERIOD` or something, provided by the slave 
containerizer, so that the executor can make a decision about whether to send 
(via Docker) a TERM (and wait 10s) or KILL signal.

> Signal escalation timeout is not configurable
> -
>
> Key: MESOS-1571
> URL: https://issues.apache.org/jira/browse/MESOS-1571
> Project: Mesos
>  Issue Type: Bug
>Reporter: Niklas Quarfot Nielsen
>Assignee: Alexander Rukletsov
>
> Even though the executor shutdown grace period is set to a larger interval, 
> the signal escalation timeout will still be 3 seconds. It should either be 
> configurable or dependent on EXECUTOR_SHUTDOWN_GRACE_PERIOD.
> Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2407) libprocess segfaults when using GLOG_v=2

2015-02-26 Thread James DeFelice (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339001#comment-14339001
 ] 

James DeFelice commented on MESOS-2407:
---

gmtime isn't thread-safe either. should use gmtime_r 
(http://linux.die.net/man/3/gmtime_r).

Other projects have had the same problem, e.g. 
http://sourceforge.net/p/levent/bugs/_discuss/thread/465dad36/948e/attachment/0001-Avoid-use-of-non-threadsafe-locale-dependent-strftim.patch


> libprocess segfaults when using GLOG_v=2
> 
>
> Key: MESOS-2407
> URL: https://issues.apache.org/jira/browse/MESOS-2407
> Project: Mesos
>  Issue Type: Bug
>Reporter: Vinod Kone
>Priority: Blocker
>
> Found this while debugging MESOS-2403. Looks like a thread safety issue with 
> stream operator in Process::resume().
> {code}
> [==] Running 1 test from 1 test case.
> [--] Global test environment set-up.
> [--] 1 test from MasterAllocatorTest/0, where TypeParam = 
> mesos::internal::master::allocator::MesosAllocator  mesos::internal::master::allocator::DRFSorter> >
> [ RUN  ] MasterAllocatorTest/0.FrameworkReregistersFirst
> I0226 00:48:28.159126 29931 process.cpp:2150] Spawned process 
> files@10.35.255.108:46621
> I0226 00:48:28.159194 29958 process.cpp:2160] Resuming 
> files@10.35.255.108:46621 at 2015-02-26 00:48:28.159184896+00:00
> I0226 00:48:28.159333 29958 process.cpp:2160] Resuming 
> help@10.35.255.108:46621 at 2015-02-26 00:48:28.159284992+00:00
> I0226 00:48:28.159343 29931 process.cpp:2150] Spawned process 
> hierarchical-allocator(25)@10.35.255.108:46621
> I0226 00:48:28.159418 29955 process.cpp:2160] Resuming 
> hierarchical-allocator(25)@10.35.255.108:46621 at 2015-02-26 
> 00:48:28.159364864+00:00
> Using temporary directory 
> '/tmp/MasterAllocatorTest_0_FrameworkReregistersFirst_J8P9UO'
> I0226 00:48:28.165838 29970 process.cpp:2117] Dropped / Lost event for PID: 
> hierarchical-allocator(22)@10.35.255.108:46621
> I0226 00:48:28.193131 29964 process.cpp:2160] Resuming 
> reaper(1)@10.35.255.108:46621 at 2015-02-26 00:48:28.193116928+00:00
> I0226 00:48:28.267730 29931 leveldb.cpp:176] Opened db in 107.694932ms
> I0226 00:48:28.281376 29931 leveldb.cpp:183] Compacted db in 13.598726ms
> I0226 00:48:28.281435 29931 leveldb.cpp:198] Created db iterator in 10363ns
> I0226 00:48:28.281461 29931 leveldb.cpp:204] Seeked to beginning of db in 
> 1180ns
> I0226 00:48:28.281491 29931 leveldb.cpp:273] Iterated through 0 keys in the 
> db in 328ns
> I0226 00:48:28.281518 29931 replica.cpp:744] Replica recovered with log 
> positions 0 -> 0 with 1 holes and 0 unlearned
> I0226 00:48:28.281559 29931 process.cpp:2150] Spawned process 
> log-replica(25)@10.35.255.108:46621
> I0226 00:48:28.281648 29959 process.cpp:2160] Resuming 
> log-replica(25)@10.35.255.108:46621 at 2015-02-26 00:48:28.281634048+00:00
> I0226 00:48:28.281716 29967 process.cpp:2160] Resuming 
> (257)@10.35.255.108:46621 at 2015-02-26 00:48:28.281709056+00:00
> I0226 00:48:28.281654 29931 process.cpp:2150] Spawned process 
> (257)@10.35.255.108:46621
> I0226 00:48:28.281837 29931 process.cpp:2150] Spawned process 
> log(25)@10.35.255.108:46621
> I0226 00:48:28.281843 29962 process.cpp:2160] Resuming 
> log(25)@10.35.255.108:46621 at 2015-02-26 00:48:28.281829120+00:00
> I0226 00:48:28.282060 29931 process.cpp:2150] Spawned process 
> log-reader(25)@10.35.255.108:46621
> I0226 00:48:28.282080 29948 process.cpp:2160] Resuming 
> log-reader(25)@10.35.255.108:46621 at 2015-02-26 00:48:28.282066944+00:00
> I0226 00:48:28.282142 29962 process.cpp:2150] Spawned process 
> log-recover(25)@10.35.255.108:46621
> I0226 00:48:28.282181 29962 process.cpp:2160] Resuming 
> log-writer(25)@10.35.255.108:46621 at 2015-02-26 00:48:28.282177024+00:00
> I0226 00:48:28.282162 29952 process.cpp:2160] Resuming 
> __gc__@10.35.255.108:46621 at 2015-02-26 00:48:28.282151936+00:00
> I0226 00:48:28.282162 29958 process.cpp:2160] Resuming 
> log-recover(25)@10.35.255.108:46621 at 2015-02-26 00:48:28.282151936+00:00
> I0226 00:48:28.282378 29958 recover.cpp:449] Starting replica recovery
> I0226 00:48:28.282438 29954 process.cpp:2160] Resuming 
> log-replica(25)@10.35.255.108:46621 at 2015-02-26 00:48:28.282430976+00:00
> I0226 00:48:28.282512 29931 process.cpp:2150] Spawned process 
> log-writer(25)@10.35.255.108:46621
> I0226 00:48:28.282533 29968 process.cpp:2160] Resuming 
> log-recover(25)@10.35.255.108:46621 at 2015-02-26 00:48:28.282522880+00:00
> I0226 00:48:28.282591 29968 recover.cpp:475] Replica is in 4 status
> I0226 00:48:28.282618 29950 process.cpp:2160] Resuming 
> metrics@10.35.255.108:46621 at 2015-02-26 00:48:28.282608128+00:00
> I0226 00:48:28.282716 29963 process.cpp:2160] Resuming 
> (258)@10.35.255.108:46621 at 2015-02-26 00:48:28.282684928+00:00
> I0226 00:48:28.28

[jira] [Commented] (MESOS-2162) Consider a C++ implementation of CoreOS AppContainer spec

2015-03-11 Thread James DeFelice (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14356934#comment-14356934
 ] 

James DeFelice commented on MESOS-2162:
---

FWIW the kubernetes team is also considering rocket support: 
https://github.com/GoogleCloudPlatform/kubernetes/issues/2725


> Consider a C++ implementation of CoreOS AppContainer spec
> -
>
> Key: MESOS-2162
> URL: https://issues.apache.org/jira/browse/MESOS-2162
> Project: Mesos
>  Issue Type: Story
>  Components: containerization
>Reporter: Dominic Hamon
>  Labels: mesosphere, twitter
>
> CoreOS have released a 
> [specification|https://github.com/coreos/rocket/blob/master/app-container/SPEC.md]
>  for a container abstraction as an alternative to Docker. They have also 
> released a reference implementation, [rocket|https://coreos.com/blog/rocket/].
> We should consider a C++ implementation of the specification to have parity 
> with the community and then use this implementation for our containerizer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-2499) SOURCE_EXECUTOR not set properly in slave.cpp

2015-03-16 Thread James DeFelice (JIRA)
James DeFelice created MESOS-2499:
-

 Summary: SOURCE_EXECUTOR not set properly in slave.cpp
 Key: MESOS-2499
 URL: https://issues.apache.org/jira/browse/MESOS-2499
 Project: Mesos
  Issue Type: Bug
  Components: slave
Reporter: James DeFelice


Slave::statusUpdate attempts to set the Source of the TaskStatus to either 
SOURCE_SLAVE or SOURCE_EXECUTOR:

https://github.com/apache/mesos/blob/0.21.0/src/slave/slave.cpp#L
{code}
TaskStatus status = update.status();
status.set_source(pid == UPID() ? TaskStatus::SOURCE_SLAVE
  : TaskStatus::SOURCE_EXECUTOR);
{code}

Unfortunately it make this change to a copy of the TaskStatus that's later 
discarded and the change is never saved to the parent StatusUpdate. This 
results in slave-forward()ed status updates that lack a proper source value.

The likely fix is that the set_source() update should be invoked on a 
TaskStatus* that's obtained via StatusUpdate.mutable_status() so that the new 
source is saved properly. It's not clear to me if StatusUpdate should be 
obtained via mutable_update().

It would also be helpful to have a unit test for this scenario.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4812) Mesos fails to escape command health checks

2017-02-08 Thread James DeFelice (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James DeFelice updated MESOS-4812:
--
Labels: health-check mesosphere  (was: health-check)

> Mesos fails to escape command health checks
> ---
>
> Key: MESOS-4812
> URL: https://issues.apache.org/jira/browse/MESOS-4812
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.25.0
>Reporter: Lukas Loesche
>Assignee: haosdent
>  Labels: health-check, mesosphere
> Attachments: health_task.gif
>
>
> As described in https://github.com/mesosphere/marathon/issues/
> I would like to run a command health check
> {noformat}
> /bin/bash -c " {noformat}
> The health check fails because Mesos, while running the command inside double 
> quotes of a sh -c "" doesn't escape the double quotes in the command.
> If I escape the double quotes myself the command health check succeeds. But 
> this would mean that the user needs intimate knowledge of how Mesos executes 
> his commands which can't be right.
> I was told this is not a Marathon but a Mesos issue so am opening this JIRA. 
> I don't know if this only affects the command health check.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-4812) Mesos fails to escape command health checks

2017-02-08 Thread James DeFelice (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15858038#comment-15858038
 ] 

James DeFelice commented on MESOS-4812:
---

would like to see some traction on this. several issues have been reported 
against Marathon, the latest of which is 
https://github.com/mesosphere/marathon/issues/5136

> Mesos fails to escape command health checks
> ---
>
> Key: MESOS-4812
> URL: https://issues.apache.org/jira/browse/MESOS-4812
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.25.0
>Reporter: Lukas Loesche
>Assignee: haosdent
>  Labels: health-check
> Attachments: health_task.gif
>
>
> As described in https://github.com/mesosphere/marathon/issues/
> I would like to run a command health check
> {noformat}
> /bin/bash -c " {noformat}
> The health check fails because Mesos, while running the command inside double 
> quotes of a sh -c "" doesn't escape the double quotes in the command.
> If I escape the double quotes myself the command health check succeeds. But 
> this would mean that the user needs intimate knowledge of how Mesos executes 
> his commands which can't be right.
> I was told this is not a Marathon but a Mesos issue so am opening this JIRA. 
> I don't know if this only affects the command health check.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-6951) Docker containerizer: mangled environment when env value contains LF byte

2017-02-13 Thread James DeFelice (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James DeFelice updated MESOS-6951:
--
Labels: mesosphere  (was: )

> Docker containerizer: mangled environment when env value contains LF byte
> -
>
> Key: MESOS-6951
> URL: https://issues.apache.org/jira/browse/MESOS-6951
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: Jan-Philip Gehrcke
>  Labels: mesosphere
>
> Consider this Marathon app definition:
> {code}
> {
>   "id": "/testapp",
>   "cmd": "env && tail -f /dev/null",
>   "env":{
> "TESTVAR":"line1\nline2"
>   },
>   "cpus": 0.1,
>   "mem": 10,
>   "instances": 1,
>   "container": {
> "type": "DOCKER",
> "docker": {
>   "image": "alpine"
> }
>   }
> }
> {code}
> The JSON-encoded newline in the value of the {{TESTVAR}} environment variable 
> leads to a corrupted task environment. What follows is a subset of the 
> resulting task environment (as printed via {{env}}, i.e. in key=value 
> notation):
> {code}
> line2=
> TESTVAR=line1
> {code}
> That is, the trailing part of the intended value ended up being interpreted 
> as variable name, and only the leading part of the intended value was used as 
> actual value for {{TESTVAR}}.
> Common application scenarios that would badly break with that involve 
> pretty-printed JSON documents or YAML documents passed along via the 
> environment.
> Following the code and information flow led to the conclusion that Docker's 
> {{--env-file}} command line interface is the weak point in the flow. It is 
> currently used in Mesos' Docker containerizer for passing the environment to 
> the container:
> {code}
>   argv.push_back("--env-file");
>   argv.push_back(environmentFile);
> {code}
> (Ref: 
> [code|https://github.com/apache/mesos/blob/c0aee8cc10b1d1f4b2db5ff12b771372fdd5b1f3/src/docker/docker.cpp#L584])
> Docker's {{--env-file}} argument behavior is documented via
> {quote}
> The --env-file flag takes a filename as an argument
> and expects each line to be in the VAR=VAL format,
> {quote}
> (Ref: https://docs.docker.com/engine/reference/commandline/run/)
> That is, Docker identifies individual environment variable key/value pair 
> definitions based on newline bytes in that file which explains the observed 
> environment variable value fragmentation. Notably, Docker does not provide a 
> mechanism for escaping newline bytes in the values specified in this 
> environment file.
> I think it is important to understand that Docker's {{--env-file}} mechanism 
> is ill-posed in the sense that it is not capable of transmitting the whole 
> range of environment variable values allowed by POSIX. That's what the Single 
> UNIX Specification, Version 3 has to say about environment variable values:
> {quote}
> the value shall be composed of characters from the
> portable character set (except NUL and as indicated below). 
> {quote}
> (Ref: http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap08.html)
> About "The portable character set": 
> http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap06.html#tagtcjh_3
> It includes (among others) the LF byte. Understandably, the current Docker 
> {{--env-file}} behavior will not change, so this is not an issue that can be 
> deferred to Docker: https://github.com/docker/docker/issues/12997
> Notably, the {{--env-file}} method for communicating environment variables to 
> Docker containers was just recently introduced to Mesos as of 
> https://issues.apache.org/jira/browse/MESOS-6566, for not leaking secrets 
> through the process listing. Previously, we specified env key/value pairs on 
> the command line which leaked secrets to the process list and probably also 
> did not support the full range of valid environment variable values.
> We need a solution that
> 1) does not leak sensitive values (i.e. is compliant with MESOS-6566).
> 2) allows for passing arbitrary environment variable values.
> It seems that Docker's {{--env}} method can be used for that. It can be used 
> to define _just the names of the environment variables_ to-be-passed-along, 
> in which case the docker binary will read the corresponding values from its 
> own environment, which we can clearly prepare appropriately when we invoke 
> the corresponding child process. This method would still leak environment 
> variable _names_ to the process listing, but (especially if documented) this 
> should be fine.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-6951) Docker containerizer: mangled environment when env value contains LF byte

2017-02-13 Thread James DeFelice (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15864613#comment-15864613
 ] 

James DeFelice commented on MESOS-6951:
---

xref https://reviews.apache.org/r/53877/#comment237227

It's actually a problem for more than just LFs. SPACE and TAB characters also 
generate errors for the docker env-file parser. In addition, docker uses 
special passthrough functionality for envvars in the env-file that have blank 
values.

> Docker containerizer: mangled environment when env value contains LF byte
> -
>
> Key: MESOS-6951
> URL: https://issues.apache.org/jira/browse/MESOS-6951
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: Jan-Philip Gehrcke
>  Labels: mesosphere
>
> Consider this Marathon app definition:
> {code}
> {
>   "id": "/testapp",
>   "cmd": "env && tail -f /dev/null",
>   "env":{
> "TESTVAR":"line1\nline2"
>   },
>   "cpus": 0.1,
>   "mem": 10,
>   "instances": 1,
>   "container": {
> "type": "DOCKER",
> "docker": {
>   "image": "alpine"
> }
>   }
> }
> {code}
> The JSON-encoded newline in the value of the {{TESTVAR}} environment variable 
> leads to a corrupted task environment. What follows is a subset of the 
> resulting task environment (as printed via {{env}}, i.e. in key=value 
> notation):
> {code}
> line2=
> TESTVAR=line1
> {code}
> That is, the trailing part of the intended value ended up being interpreted 
> as variable name, and only the leading part of the intended value was used as 
> actual value for {{TESTVAR}}.
> Common application scenarios that would badly break with that involve 
> pretty-printed JSON documents or YAML documents passed along via the 
> environment.
> Following the code and information flow led to the conclusion that Docker's 
> {{--env-file}} command line interface is the weak point in the flow. It is 
> currently used in Mesos' Docker containerizer for passing the environment to 
> the container:
> {code}
>   argv.push_back("--env-file");
>   argv.push_back(environmentFile);
> {code}
> (Ref: 
> [code|https://github.com/apache/mesos/blob/c0aee8cc10b1d1f4b2db5ff12b771372fdd5b1f3/src/docker/docker.cpp#L584])
> Docker's {{--env-file}} argument behavior is documented via
> {quote}
> The --env-file flag takes a filename as an argument
> and expects each line to be in the VAR=VAL format,
> {quote}
> (Ref: https://docs.docker.com/engine/reference/commandline/run/)
> That is, Docker identifies individual environment variable key/value pair 
> definitions based on newline bytes in that file which explains the observed 
> environment variable value fragmentation. Notably, Docker does not provide a 
> mechanism for escaping newline bytes in the values specified in this 
> environment file.
> I think it is important to understand that Docker's {{--env-file}} mechanism 
> is ill-posed in the sense that it is not capable of transmitting the whole 
> range of environment variable values allowed by POSIX. That's what the Single 
> UNIX Specification, Version 3 has to say about environment variable values:
> {quote}
> the value shall be composed of characters from the
> portable character set (except NUL and as indicated below). 
> {quote}
> (Ref: http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap08.html)
> About "The portable character set": 
> http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap06.html#tagtcjh_3
> It includes (among others) the LF byte. Understandably, the current Docker 
> {{--env-file}} behavior will not change, so this is not an issue that can be 
> deferred to Docker: https://github.com/docker/docker/issues/12997
> Notably, the {{--env-file}} method for communicating environment variables to 
> Docker containers was just recently introduced to Mesos as of 
> https://issues.apache.org/jira/browse/MESOS-6566, for not leaking secrets 
> through the process listing. Previously, we specified env key/value pairs on 
> the command line which leaked secrets to the process list and probably also 
> did not support the full range of valid environment variable values.
> We need a solution that
> 1) does not leak sensitive values (i.e. is compliant with MESOS-6566).
> 2) allows for passing arbitrary environment variable values.
> It seems that Docker's {{--env}} method can be used for that. It can be used 
> to define _just the names of the environment variables_ to-be-passed-along, 
> in which case the docker binary will read the corresponding values from its 
> own environment, which we can clearly prepare appropriately when we invoke 
> the corresponding child process. This method would still leak environment 
> variable _names_ to the process listing, but (especially if documented) this 
> should be fine.



--

[jira] [Comment Edited] (MESOS-6951) Docker containerizer: mangled environment when env value contains LF byte

2017-02-13 Thread James DeFelice (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15864613#comment-15864613
 ] 

James DeFelice edited comment on MESOS-6951 at 2/13/17 10:45 PM:
-

xref https://reviews.apache.org/r/53877/#comment237227

It's actually a problem for more than just LFs. SPACE and TAB characters also 
generate errors for the docker env-file parser. In addition, docker uses 
special passthrough functionality for envvars in the env-file that have blank 
values.

see 
https://github.com/docker/docker/blob/d5fe259e121b3c86d1de0dae1760aafb48507ea9/runconfig/opts/envfile.go#L26


was (Author: jdef):
xref https://reviews.apache.org/r/53877/#comment237227

It's actually a problem for more than just LFs. SPACE and TAB characters also 
generate errors for the docker env-file parser. In addition, docker uses 
special passthrough functionality for envvars in the env-file that have blank 
values.

> Docker containerizer: mangled environment when env value contains LF byte
> -
>
> Key: MESOS-6951
> URL: https://issues.apache.org/jira/browse/MESOS-6951
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: Jan-Philip Gehrcke
>  Labels: mesosphere
>
> Consider this Marathon app definition:
> {code}
> {
>   "id": "/testapp",
>   "cmd": "env && tail -f /dev/null",
>   "env":{
> "TESTVAR":"line1\nline2"
>   },
>   "cpus": 0.1,
>   "mem": 10,
>   "instances": 1,
>   "container": {
> "type": "DOCKER",
> "docker": {
>   "image": "alpine"
> }
>   }
> }
> {code}
> The JSON-encoded newline in the value of the {{TESTVAR}} environment variable 
> leads to a corrupted task environment. What follows is a subset of the 
> resulting task environment (as printed via {{env}}, i.e. in key=value 
> notation):
> {code}
> line2=
> TESTVAR=line1
> {code}
> That is, the trailing part of the intended value ended up being interpreted 
> as variable name, and only the leading part of the intended value was used as 
> actual value for {{TESTVAR}}.
> Common application scenarios that would badly break with that involve 
> pretty-printed JSON documents or YAML documents passed along via the 
> environment.
> Following the code and information flow led to the conclusion that Docker's 
> {{--env-file}} command line interface is the weak point in the flow. It is 
> currently used in Mesos' Docker containerizer for passing the environment to 
> the container:
> {code}
>   argv.push_back("--env-file");
>   argv.push_back(environmentFile);
> {code}
> (Ref: 
> [code|https://github.com/apache/mesos/blob/c0aee8cc10b1d1f4b2db5ff12b771372fdd5b1f3/src/docker/docker.cpp#L584])
> Docker's {{--env-file}} argument behavior is documented via
> {quote}
> The --env-file flag takes a filename as an argument
> and expects each line to be in the VAR=VAL format,
> {quote}
> (Ref: https://docs.docker.com/engine/reference/commandline/run/)
> That is, Docker identifies individual environment variable key/value pair 
> definitions based on newline bytes in that file which explains the observed 
> environment variable value fragmentation. Notably, Docker does not provide a 
> mechanism for escaping newline bytes in the values specified in this 
> environment file.
> I think it is important to understand that Docker's {{--env-file}} mechanism 
> is ill-posed in the sense that it is not capable of transmitting the whole 
> range of environment variable values allowed by POSIX. That's what the Single 
> UNIX Specification, Version 3 has to say about environment variable values:
> {quote}
> the value shall be composed of characters from the
> portable character set (except NUL and as indicated below). 
> {quote}
> (Ref: http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap08.html)
> About "The portable character set": 
> http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap06.html#tagtcjh_3
> It includes (among others) the LF byte. Understandably, the current Docker 
> {{--env-file}} behavior will not change, so this is not an issue that can be 
> deferred to Docker: https://github.com/docker/docker/issues/12997
> Notably, the {{--env-file}} method for communicating environment variables to 
> Docker containers was just recently introduced to Mesos as of 
> https://issues.apache.org/jira/browse/MESOS-6566, for not leaking secrets 
> through the process listing. Previously, we specified env key/value pairs on 
> the command line which leaked secrets to the process list and probably also 
> did not support the full range of valid environment variable values.
> We need a solution that
> 1) does not leak sensitive values (i.e. is compliant with MESOS-6566).
> 2) allows for passing arbitrary environment variable values.
> It seems t

[jira] [Updated] (MESOS-1971) Switch cgroups_limit_swap default to true

2017-03-05 Thread James DeFelice (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James DeFelice updated MESOS-1971:
--
Labels: mesosphere  (was: )

> Switch cgroups_limit_swap default to true
> -
>
> Key: MESOS-1971
> URL: https://issues.apache.org/jira/browse/MESOS-1971
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Anton Lindström
>Priority: Minor
>  Labels: mesosphere
>
> Switch cgroups_limit_swap to true per default, see MESOS-1662 for more 
> information.
> Thanks!



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (MESOS-7243) CNI implementation assumptions don't align with NetworkInfo proto docs

2017-03-14 Thread James DeFelice (JIRA)
James DeFelice created MESOS-7243:
-

 Summary: CNI implementation assumptions don't align with 
NetworkInfo proto docs
 Key: MESOS-7243
 URL: https://issues.apache.org/jira/browse/MESOS-7243
 Project: Mesos
  Issue Type: Bug
Reporter: James DeFelice


The protobuf docs for NetworkInfo state that frameworks may request one or more 
IP addresses via the `ip_addresses` field: the actual `ip_address` and 
`protocol` may be left blank; one entry is required for each IP address 
requested at task-launch time.

The CNI implementation doesn't check `ip_addresses` and provides 1 address by 
default. This behavior breaks with the docs.

It's been suggested that it is "legal" for the CNI implementation to assume 
that if `ip_addresses` was completely empty, that would translate to 
`ip_addresses: [ {} ]` (requesting a single IP address). I've argued against 
this logic: by assuming such a default, it becomes impossible for a container 
to express interest in joining a network (namespace) but not actually be 
allocated an IP address. This might be an edge case, but it's one that's ruled 
out as soon as we assume empty-collection == give-me-one-address-please. FWIW 
the Marathon API has made this (empty = give me a default) mistake several 
times and it has burned us. Strongly urge caution here.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (MESOS-7326) ship a sample CNI config that provides a mesos-bridge network, analogous to the default bridge that ships with Docker

2017-03-30 Thread James DeFelice (JIRA)
James DeFelice created MESOS-7326:
-

 Summary: ship a sample CNI config that provides a mesos-bridge 
network, analogous to the default bridge that ships with Docker
 Key: MESOS-7326
 URL: https://issues.apache.org/jira/browse/MESOS-7326
 Project: Mesos
  Issue Type: Task
  Components: documentation, network
Reporter: James DeFelice


UCR supports port mapping and marathon now ships an API that lets users specify 
a `container/bridge` networking mode. by default, this maps to a CNI network 
called `mesos-bridge`. Mesos should, at least:

# ship a sample CNI configuration file that requires minimal edits, if any, to 
enable a `mesos-bridge` CNI network for a vanilla mesos install; like docker, 
the default bridge defaults to a "host-private" mode of operation
# clearly document the steps required to enable this default bridge network for 
simple mesos clusters

/cc [~jieyu]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (MESOS-7375) provide additional insight for framework developers re: GPU_RESOURCES capability

2017-04-11 Thread James DeFelice (JIRA)
James DeFelice created MESOS-7375:
-

 Summary: provide additional insight for framework developers re: 
GPU_RESOURCES capability
 Key: MESOS-7375
 URL: https://issues.apache.org/jira/browse/MESOS-7375
 Project: Mesos
  Issue Type: Documentation
Reporter: James DeFelice


On clusters where all nodes are equal and every node has a GPU, frameworks that 
**don't** opt-in to the `GPU_RESOURCES` capability won't get any offers. This 
is surprising for operators.

Even when a framework doesn't **need** GPU resources, it may make sense for a 
framework scheduler to provide a `--enable-gpu-compat` (or similar) flag that 
results in the framework advertising the `GPU_RESOURCES` capability even though 
it does not intend to consume any GPU. The effect being that said framework 
will now receive offers on clusters where all nodes have GPU resources.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7375) provide additional insight for framework developers re: GPU_RESOURCES capability

2017-04-11 Thread James DeFelice (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James DeFelice updated MESOS-7375:
--
Description: 
On clusters where all nodes are equal and every node has a GPU, frameworks that 
**don't** opt-in to the `GPU_RESOURCES` capability won't get any offers. This 
is surprising for operators.

Even when a framework doesn't **need** GPU resources, it may make sense for a 
framework scheduler to provide a `--gpu-cluster-compat` (or similar) flag that 
results in the framework advertising the `GPU_RESOURCES` capability even though 
it does not intend to consume any GPU. The effect being that said framework 
will now receive offers on clusters where all nodes have GPU resources.

  was:
On clusters where all nodes are equal and every node has a GPU, frameworks that 
**don't** opt-in to the `GPU_RESOURCES` capability won't get any offers. This 
is surprising for operators.

Even when a framework doesn't **need** GPU resources, it may make sense for a 
framework scheduler to provide a `--enable-gpu-compat` (or similar) flag that 
results in the framework advertising the `GPU_RESOURCES` capability even though 
it does not intend to consume any GPU. The effect being that said framework 
will now receive offers on clusters where all nodes have GPU resources.


> provide additional insight for framework developers re: GPU_RESOURCES 
> capability
> 
>
> Key: MESOS-7375
> URL: https://issues.apache.org/jira/browse/MESOS-7375
> Project: Mesos
>  Issue Type: Documentation
>Reporter: James DeFelice
>  Labels: mesosphere
>
> On clusters where all nodes are equal and every node has a GPU, frameworks 
> that **don't** opt-in to the `GPU_RESOURCES` capability won't get any offers. 
> This is surprising for operators.
> Even when a framework doesn't **need** GPU resources, it may make sense for a 
> framework scheduler to provide a `--gpu-cluster-compat` (or similar) flag 
> that results in the framework advertising the `GPU_RESOURCES` capability even 
> though it does not intend to consume any GPU. The effect being that said 
> framework will now receive offers on clusters where all nodes have GPU 
> resources.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (MESOS-7411) Frameworks may specify alternate "stop" signal vs. SIGTERM

2017-04-21 Thread James DeFelice (JIRA)
James DeFelice created MESOS-7411:
-

 Summary: Frameworks may specify alternate "stop" signal vs. SIGTERM
 Key: MESOS-7411
 URL: https://issues.apache.org/jira/browse/MESOS-7411
 Project: Mesos
  Issue Type: Improvement
Reporter: James DeFelice


Normally Mesos sends a {{SIGTERM}} that escalates to {{SIGKILL}} when stopping 
a running process. Some apps handle {{SIGTERM}} differently and expose "stop" 
behavior via an alternate signal, for example {{SIGRTMIN+3}} (looking at you, 
systemd). It should be possible for a framework to specify an alternate "stop" 
signal via the Mesos API, which Mesos would then send to a process when 
attempting to stop it (before escalating to {{SIGKILL}}).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-7411) Frameworks may specify alternate "stop" signal vs. SIGTERM

2017-04-21 Thread James DeFelice (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15979231#comment-15979231
 ] 

James DeFelice commented on MESOS-7411:
---

Related Docker feature: 
https://docs.docker.com/engine/reference/commandline/run/#stop-container-with-signal---stop-signal

> Frameworks may specify alternate "stop" signal vs. SIGTERM
> --
>
> Key: MESOS-7411
> URL: https://issues.apache.org/jira/browse/MESOS-7411
> Project: Mesos
>  Issue Type: Improvement
>Reporter: James DeFelice
>  Labels: mesosphere
>
> Normally Mesos sends a {{SIGTERM}} that escalates to {{SIGKILL}} when 
> stopping a running process. Some apps handle {{SIGTERM}} differently and 
> expose "stop" behavior via an alternate signal, for example {{SIGRTMIN+3}} 
> (looking at you, systemd). It should be possible for a framework to specify 
> an alternate "stop" signal via the Mesos API, which Mesos would then send to 
> a process when attempting to stop it (before escalating to {{SIGKILL}}).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Comment Edited] (MESOS-7271) JNI SIGSEGV failed when connecting spark to mesos master

2017-05-18 Thread James DeFelice (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16015769#comment-16015769
 ] 

James DeFelice edited comment on MESOS-7271 at 5/18/17 1:46 PM:


[~mgummelt] the stack trace is from an OpenJDK platform; is that what you're 
testing with?


was (Author: jdef):
[~mgummelt]the stack trace is from an OpenJDK platform; is that what you're 
testing with?

> JNI SIGSEGV failed when connecting spark to mesos master
> 
>
> Key: MESOS-7271
> URL: https://issues.apache.org/jira/browse/MESOS-7271
> Project: Mesos
>  Issue Type: Bug
>  Components: java api
>Affects Versions: 1.1.0, 1.2.0
> Environment: Ubuntu 16.04, OpenJDK 8, Spark 2.1.1
>Reporter: Qi Cui
>
> Run starting. Expected test count is: 1
> SampleDataFrameTest:
> 17/03/20 11:53:16 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> WARNING: Logging before InitGoogleLogging() is written to STDERR
> I0320 11:53:19.775842  4679 process.cpp:1071] libprocess is initialized on 
> 192.168.0.99:38293 with 8 worker threads
> I0320 11:53:19.775975  4679 logging.cpp:199] Logging to STDERR
> I0320 11:53:19.789871  4725 sched.cpp:226] Version: 1.1.0
> I0320 11:53:19.832826  4717 sched.cpp:330] New master detected at 
> master@192.168.0.50:5050
> I0320 11:53:19.838253  4717 sched.cpp:341] No credentials provided. 
> Attempting to register without authentication
> I0320 11:53:19.838337  4717 sched.cpp:820] Sending SUBSCRIBE call to 
> master@192.168.0.50:5050
> I0320 11:53:19.840265  4717 sched.cpp:853] Will retry registration in 
> 32.354951ms if necessary
> I0320 11:53:19.844734  4717 sched.cpp:743] Framework registered with 
> 6e147824-5d88-411b-9c09-a7137565c309-0001
> I0320 11:53:19.864850  4717 sched.cpp:757] Scheduler::registered took 
> 20.022604ms
> ERROR: exception pending on entry to FindMesosClass()
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGSEGV (0xb) at pc=0x7ffa06fea4a6, pid=4677, tid=0x7ff9a1a46700
> #
> # JRE version: OpenJDK Runtime Environment (8.0_121-b13) (build 
> 1.8.0_121-8u121-b13-0ubuntu1.16.04.2-b13)
> # Java VM: OpenJDK 64-Bit Server VM (25.121-b13 mixed mode linux-amd64 
> compressed oops)
> # Problematic frame:
> # V  [libjvm.so+0x6744a6]
> #
> # Failed to write core dump. Core dumps have been disabled. To enable core 
> dumping, try "ulimit -c unlimited" before starting Java again
> #
> # An error report file with more information is saved as:
> # /media/sf_G_DRIVE/src/spark-testing-base/hs_err_pid4677.log
> #
> # If you would like to submit a bug report, please visit:
> #   http://bugreport.java.com/bugreport/crash.jsp
> #



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-7271) JNI SIGSEGV failed when connecting spark to mesos master

2017-05-18 Thread James DeFelice (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16015769#comment-16015769
 ] 

James DeFelice commented on MESOS-7271:
---

[~mgummelt]the stack trace is from an OpenJDK platform; is that what you're 
testing with?

> JNI SIGSEGV failed when connecting spark to mesos master
> 
>
> Key: MESOS-7271
> URL: https://issues.apache.org/jira/browse/MESOS-7271
> Project: Mesos
>  Issue Type: Bug
>  Components: java api
>Affects Versions: 1.1.0, 1.2.0
> Environment: Ubuntu 16.04, OpenJDK 8, Spark 2.1.1
>Reporter: Qi Cui
>
> Run starting. Expected test count is: 1
> SampleDataFrameTest:
> 17/03/20 11:53:16 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> WARNING: Logging before InitGoogleLogging() is written to STDERR
> I0320 11:53:19.775842  4679 process.cpp:1071] libprocess is initialized on 
> 192.168.0.99:38293 with 8 worker threads
> I0320 11:53:19.775975  4679 logging.cpp:199] Logging to STDERR
> I0320 11:53:19.789871  4725 sched.cpp:226] Version: 1.1.0
> I0320 11:53:19.832826  4717 sched.cpp:330] New master detected at 
> master@192.168.0.50:5050
> I0320 11:53:19.838253  4717 sched.cpp:341] No credentials provided. 
> Attempting to register without authentication
> I0320 11:53:19.838337  4717 sched.cpp:820] Sending SUBSCRIBE call to 
> master@192.168.0.50:5050
> I0320 11:53:19.840265  4717 sched.cpp:853] Will retry registration in 
> 32.354951ms if necessary
> I0320 11:53:19.844734  4717 sched.cpp:743] Framework registered with 
> 6e147824-5d88-411b-9c09-a7137565c309-0001
> I0320 11:53:19.864850  4717 sched.cpp:757] Scheduler::registered took 
> 20.022604ms
> ERROR: exception pending on entry to FindMesosClass()
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGSEGV (0xb) at pc=0x7ffa06fea4a6, pid=4677, tid=0x7ff9a1a46700
> #
> # JRE version: OpenJDK Runtime Environment (8.0_121-b13) (build 
> 1.8.0_121-8u121-b13-0ubuntu1.16.04.2-b13)
> # Java VM: OpenJDK 64-Bit Server VM (25.121-b13 mixed mode linux-amd64 
> compressed oops)
> # Problematic frame:
> # V  [libjvm.so+0x6744a6]
> #
> # Failed to write core dump. Core dumps have been disabled. To enable core 
> dumping, try "ulimit -c unlimited" before starting Java again
> #
> # An error report file with more information is saved as:
> # /media/sf_G_DRIVE/src/spark-testing-base/hs_err_pid4677.log
> #
> # If you would like to submit a bug report, please visit:
> #   http://bugreport.java.com/bugreport/crash.jsp
> #



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-6512) Add support for cgroups hugetlb subsystem

2017-05-18 Thread James DeFelice (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16015841#comment-16015841
 ] 

James DeFelice commented on MESOS-6512:
---

basic isolator support for this seems to have landed: 
https://github.com/apache/mesos/commit/b495fda02566ee6e47ac5a618a6b14ff556e0d76

> Add support for cgroups hugetlb subsystem
> -
>
> Key: MESOS-6512
> URL: https://issues.apache.org/jira/browse/MESOS-6512
> Project: Mesos
>  Issue Type: Task
>Reporter: haosdent
>Assignee: haosdent
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-6512) Add support for cgroups hugetlb subsystem

2017-05-18 Thread James DeFelice (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16015844#comment-16015844
 ] 

James DeFelice commented on MESOS-6512:
---

k8s is working on something similar 
https://github.com/kubernetes/kubernetes/pull/44817

> Add support for cgroups hugetlb subsystem
> -
>
> Key: MESOS-6512
> URL: https://issues.apache.org/jira/browse/MESOS-6512
> Project: Mesos
>  Issue Type: Task
>Reporter: haosdent
>Assignee: haosdent
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (MESOS-7523) Whitelist devices in bulk on a per-container basis

2017-05-18 Thread James DeFelice (JIRA)
James DeFelice created MESOS-7523:
-

 Summary: Whitelist devices in bulk on a per-container basis
 Key: MESOS-7523
 URL: https://issues.apache.org/jira/browse/MESOS-7523
 Project: Mesos
  Issue Type: Bug
Reporter: James DeFelice


Continuation of the work in MESOS-6791

It should be possible to whitelist a range (R) of devices such that R may be 
exposed to a container launched by an agent. Not all containers should have 
access to R by default, only those containers whose ContainerInfo specifies 
such access.

For example, it may be useful to whitelist the range of devices matching the 
glob expressions `/dev/{s,h,xv}d[a-z]*` and `/dev/dm-*` and `/dev/mapper/*` for 
a container that intends to manage storage devices.

/cc [~jieyu]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7523) Whitelist devices in bulk on a per-container basis

2017-05-18 Thread James DeFelice (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James DeFelice updated MESOS-7523:
--
Description: 
Continuation of the work in MESOS-6791

It should be possible to whitelist a range (R) of devices such that R may be 
exposed to a container launched by an agent. Not all containers should have 
access to R by default, only those containers whose ContainerInfo specifies 
such access.

For example, it may be useful to whitelist the range of devices matching the 
glob expressions `/dev/\{s,h,xv}d\[a-z]*` and `/dev/dm-\*` and `/dev/mapper/\*` 
for a container that intends to manage storage devices.

/cc [~jieyu]

  was:
Continuation of the work in MESOS-6791

It should be possible to whitelist a range (R) of devices such that R may be 
exposed to a container launched by an agent. Not all containers should have 
access to R by default, only those containers whose ContainerInfo specifies 
such access.

For example, it may be useful to whitelist the range of devices matching the 
glob expressions `/dev/{s,h,xv}d[a-z]*` and `/dev/dm-*` and `/dev/mapper/*` for 
a container that intends to manage storage devices.

/cc [~jieyu]


> Whitelist devices in bulk on a per-container basis
> --
>
> Key: MESOS-7523
> URL: https://issues.apache.org/jira/browse/MESOS-7523
> Project: Mesos
>  Issue Type: Bug
>Reporter: James DeFelice
>  Labels: mesosphere
>
> Continuation of the work in MESOS-6791
> It should be possible to whitelist a range (R) of devices such that R may be 
> exposed to a container launched by an agent. Not all containers should have 
> access to R by default, only those containers whose ContainerInfo specifies 
> such access.
> For example, it may be useful to whitelist the range of devices matching the 
> glob expressions `/dev/\{s,h,xv}d\[a-z]*` and `/dev/dm-\*` and 
> `/dev/mapper/\*` for a container that intends to manage storage devices.
> /cc [~jieyu]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (MESOS-5388) MesosContainerizerLaunch flags execute arbitrary commands via shell

2016-05-16 Thread James DeFelice (JIRA)
James DeFelice created MESOS-5388:
-

 Summary: MesosContainerizerLaunch flags execute arbitrary commands 
via shell
 Key: MESOS-5388
 URL: https://issues.apache.org/jira/browse/MESOS-5388
 Project: Mesos
  Issue Type: Bug
Reporter: James DeFelice


For example, the docker volume isolator's containerPath is appended (without 
sanitation) to a command that's executed in this manner. As such, it's possible 
to inject arbitrary shell commands to be executed by mesos.

https://github.com/apache/mesos/blob/17260204c833c643adf3d8f36ad8a1a606ece809/src/slave/containerizer/mesos/launch.cpp#L206

Perhaps instead of strings these commands could/should be sent as string arrays 
that could be passed as argv arguments w/o shell interpretation?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-5389) docker containerizer should prefix relative volume.container_path values with the path to the sandbox

2016-05-16 Thread James DeFelice (JIRA)
James DeFelice created MESOS-5389:
-

 Summary: docker containerizer should prefix relative 
volume.container_path values with the path to the sandbox
 Key: MESOS-5389
 URL: https://issues.apache.org/jira/browse/MESOS-5389
 Project: Mesos
  Issue Type: Bug
Reporter: James DeFelice


docker containerizer currently requires absolute paths for values of 
volume.container_path. this is inconsistent with the mesos containerizer which 
requires relative container_path. it makes for a confusing API. both at the 
Mesos level as well as at the Marathon level.

ideally the docker containerizer would allow a framework to specify a relative 
path for volume.container_path and in such cases automatically convert it to an 
absolute path by prepending the sandbox directory to it.

/cc [~jieyu]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5389) docker containerizer should prefix relative volume.container_path values with the path to the sandbox

2016-05-17 Thread James DeFelice (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15286383#comment-15286383
 ] 

James DeFelice commented on MESOS-5389:
---

Disagree. The containerizers should both being able to accept relative
containerPath's otherwise the API's become confusing for the end user. The
user doesn't always know the path to the sandbox that Mesos (or dc/os) will
append so asking the user to specify an absolute containerPath (if they
want the vol mounted in their sandbox) is a non-starter.





-- 
James DeFelice
585.241.9488 (voice)
650.649.6071 (fax)


> docker containerizer should prefix relative volume.container_path values with 
> the path to the sandbox
> -
>
> Key: MESOS-5389
> URL: https://issues.apache.org/jira/browse/MESOS-5389
> Project: Mesos
>  Issue Type: Bug
>Reporter: James DeFelice
>  Labels: docker, mesosphere, storage, volumes
>
> docker containerizer currently requires absolute paths for values of 
> volume.container_path. this is inconsistent with the mesos containerizer 
> which requires relative container_path. it makes for a confusing API. both at 
> the Mesos level as well as at the Marathon level.
> ideally the docker containerizer would allow a framework to specify a 
> relative path for volume.container_path and in such cases automatically 
> convert it to an absolute path by prepending the sandbox directory to it.
> /cc [~jieyu]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5389) docker containerizer should prefix relative volume.container_path values with the path to the sandbox

2016-05-17 Thread James DeFelice (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15286386#comment-15286386
 ] 

James DeFelice commented on MESOS-5389:
---

s/will append/will use/

> docker containerizer should prefix relative volume.container_path values with 
> the path to the sandbox
> -
>
> Key: MESOS-5389
> URL: https://issues.apache.org/jira/browse/MESOS-5389
> Project: Mesos
>  Issue Type: Bug
>Reporter: James DeFelice
>  Labels: docker, mesosphere, storage, volumes
>
> docker containerizer currently requires absolute paths for values of 
> volume.container_path. this is inconsistent with the mesos containerizer 
> which requires relative container_path. it makes for a confusing API. both at 
> the Mesos level as well as at the Marathon level.
> ideally the docker containerizer would allow a framework to specify a 
> relative path for volume.container_path and in such cases automatically 
> convert it to an absolute path by prepending the sandbox directory to it.
> /cc [~jieyu]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-5537) http v1 SUBSCRIBED scheduler event always has nil http_interval_seconds

2016-06-03 Thread James DeFelice (JIRA)
James DeFelice created MESOS-5537:
-

 Summary: http v1 SUBSCRIBED scheduler event always has nil 
http_interval_seconds
 Key: MESOS-5537
 URL: https://issues.apache.org/jira/browse/MESOS-5537
 Project: Mesos
  Issue Type: Bug
Reporter: James DeFelice
 Fix For: 1.0.0


I'm writing a controller in Go to monitor heartbeats. I'd like to use the 
interval as communicated by the master, which should be specified in the 
SUBSCRIBED event. But it's not.

{code}
2016/06/03 18:34:04 {Type:SUBSCRIBED 
Subscribed:&Event_Subscribed{FrameworkID:&mesos.FrameworkID{Value:ffdb6d6e-0167-4fa2-98f9-2c3f8157fc25-0004,},HeartbeatIntervalSeconds:nil,}
 Offers:nil Rescind:nil Update:nil Message:nil Failure:nil Error:nil}
{code}

{code}
$ dpkg -l |grep -e mesos
ii  mesos   0.28.0-2.0.16.ubuntu1404 amd64  
  Cluster resource manager with efficient resource isolation
{code}

I *am* seeing HEARTBEAT events. Just not seeing the interval specified in the 
SUBSCRIBED event.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-5853) http v1 API should document behavior regarding generated content-type header in the presence of errors

2016-07-15 Thread James DeFelice (JIRA)
James DeFelice created MESOS-5853:
-

 Summary: http v1 API should document behavior regarding generated 
content-type header in the presence of errors
 Key: MESOS-5853
 URL: https://issues.apache.org/jira/browse/MESOS-5853
 Project: Mesos
  Issue Type: Bug
  Components: documentation
Reporter: James DeFelice


Changes made as part of https://issues.apache.org/jira/browse/MESOS-3739 set a 
default Content-Type header. This should be documented in the Mesos v1 HTTP API 
literature so that devs implementing against the spec know what to expect.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-5899) docker containerizer should log a warning when docker_volume.driver_options are specified

2016-07-25 Thread James DeFelice (JIRA)
James DeFelice created MESOS-5899:
-

 Summary: docker containerizer should log a warning when 
docker_volume.driver_options are specified
 Key: MESOS-5899
 URL: https://issues.apache.org/jira/browse/MESOS-5899
 Project: Mesos
  Issue Type: Improvement
  Components: docker
Reporter: James DeFelice


currently the docker containerizer ignores the values of 
docker_volume.driver_options which could be confusing to framework devs trying 
to use docker volume plugins with the docker containerizer. the docker 
containerizer should probably log a warning that it's ignoring driver_options 
if they're present in the docker_volume protobuf. it might also be a good idea 
to document this limitation in the protobuf (.proto) files themselves.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-8237) Strip Resource.allocation_info for non-MULTI_ROLE schedulers.

2017-11-30 Thread James DeFelice (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16273358#comment-16273358
 ] 

James DeFelice commented on MESOS-8237:
---

[~bmahler] I don't foresee an issue w/ Offer.allocation_info related to 
mesos-go. I can't speak for other Mesos library maintainers.

> Strip Resource.allocation_info for non-MULTI_ROLE schedulers.
> -
>
> Key: MESOS-8237
> URL: https://issues.apache.org/jira/browse/MESOS-8237
> Project: Mesos
>  Issue Type: Bug
>Reporter: James DeFelice
>Assignee: Benjamin Mahler
>  Labels: mesosphere
>
> In support of MULTI_ROLE capable frameworks, a Resource.allocation_info field 
> was added and the Resource math of the Mesos library was updated to check for 
> matching allocation_info when checking for (in)equality, addability, 
> subtractability, containment, etc. To compensate for these changes, the demo 
> frameworks of Mesos were updated to set the allocation_info for Resource 
> objects during the "matching phase" in which offers' resources are evaluated 
> in order for the framework to launch tasks. The Mesos demo frameworks NEEDED 
> to be updated because the Resource algebra within Mesos now depended on 
> matching allocation_info fields of Resource objects when executing algebraic 
> operations. See 
> https://github.com/apache/mesos/commit/c20744a9976b5e83698e9c6062218abb4d2e6b25#diff-298cc6a77862b7ff3422cd06c215ef28R91
>  .
> This poses a unique problem for **external** libraries that both aim to 
> support various frameworks, some that DO and some that DO NOT opt-in to the 
> MULTI_ROLE capability; specifically those external libraries that implement 
> Resource algebra that's consistent with what Mesos implements internally. One 
> such example of a library is mesos-go, though there are undoubtedly others. 
> The problem can be explained via this scenario: 
> {quote}
> Flo's mesos-go framework is running well, it doesn't opt-in to MULTI_ROLE 
> because it doesn't need multiple roles. His framework runs on a version of 
> Mesos that existed prior to integration of MULTI_ROLE support. His DC 
> operator upgrades the mesos cluster to the latest version. Flo rebuilds his 
> framework on the latest version of mesos-go and re-launches it on the 
> cluster. He observes that his framework receives offers, but rejects ALL of 
> them. Digging into the code he realizes that Mesos is injecting 
> allocation_info into Resource objects being offered to his framework, and 
> mesos-go considers allocation_info when comparing Resource objects (because 
> it's MULTI_ROLE compatible now), but his framework doesn't take this into 
> consideration when preparing its own Resource objects prior to the "resource 
> matching phase". The consequence is that Flo's framework is trying to match 
> against Resources that will never align because his framework isn't setting 
> an allocation_info that might possibly match the allocation_info that Mesos 
> is always injecting - regardless of the MULTI_ROLE capability (or lack 
> thereof in this case) of his framework.
> {quote}
> If Mesos were to strip the allocation_info from Resource objects, prior to 
> offering them to non-multi-role frameworks, then the problem illustrated 
> above would go away.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-8244) Add operator API to reload local resource providers.

2017-12-10 Thread James DeFelice (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16285458#comment-16285458
 ] 

James DeFelice commented on MESOS-8244:
---

It looks to have landed here: https://reviews.apache.org/r/63901/ but there's a 
bug in the API (I left a comment on the review)

> Add operator API to reload local resource providers.
> 
>
> Key: MESOS-8244
> URL: https://issues.apache.org/jira/browse/MESOS-8244
> Project: Mesos
>  Issue Type: Task
>Reporter: Chun-Hung Hsiao
>Assignee: Chun-Hung Hsiao
>  Labels: mesosphere, storage
>
> To add, remove and update local resource providers on the fly more 
> conveniently and without restarting agents, we would like to introduce new 
> operator API to add new config files in the resource provider config 
> directory and trigger a reload for the resource provider.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8342) Make the Docker containerizer exhibit the same behavior as Mesos/UCR which sets `memory.memsw.limit_in_bytes` to be equal to `memory.limit_in_bytes` when `MESOS_CGROUPS_L

2017-12-18 Thread James DeFelice (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James DeFelice updated MESOS-8342:
--
Labels: mesosphere  (was: )

> Make the Docker containerizer exhibit the same behavior as Mesos/UCR which 
> sets `memory.memsw.limit_in_bytes` to be equal to `memory.limit_in_bytes` 
> when `MESOS_CGROUPS_LIMIT_SWAP=true`
> -
>
> Key: MESOS-8342
> URL: https://issues.apache.org/jira/browse/MESOS-8342
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization, docker
>Affects Versions: 1.4.1
>Reporter: Vishnu Mohan
>  Labels: mesosphere
>
> Please add support for the functionality afforded by the {{\--memory-swap}} 
> and {{\--memory-swappiness}} {{docker run}} options to the Docker 
> Containerizer: 
> https://github.com/apache/mesos/blob/1.4.x/src/docker/docker.hpp#L193-L194
> ATM the Docker containerizer does not honor 
> {{MESOS_CGROUPS_LIMIT_SWAP=true}}, and depending on the OS {{swappiness}} 
> configuration, the Docker Engine will (typically) set 
> {{memory.memsw.limit_in_bytes}} to twice the value of 
> {{memory.limit_in_bytes}}
> This means that all Docker containers can/will swap up to 2x their allocation 
> before being OOM-killed by the Docker Engine which can cause a huge 
> performance problem.
> The only real workaround, for now, is to ensure that all apps that are 
> launched by the Docker containerizer (at least those that are launched via 
> Marathon) also pass {{\--memory-swap=}} and/or pass 
> {{\--memory-swappiness=0}}, depending on the version of the Docker Engine, 
> ({{docker run --help}}), as arbitrary Docker params, assuming the scheduler 
> supports it, which is operationally cumbersome.
> Ideally, the Docker containerizer would exhibit the same behavior as 
> Mesos/UCR which sets {{memory.memsw.limit_in_bytes}} to be equal to 
> {{memory.limit_in_bytes}} when {{MESOS_CGROUPS_LIMIT_SWAP=true}}
> Ref: 
> https://docs.docker.com/engine/admin/resource_constraints/#prevent-a-container-from-using-swap



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8693) agent: update_resource_provider w/ identical RP info should not always force-restart plugin

2018-03-19 Thread James DeFelice (JIRA)
James DeFelice created MESOS-8693:
-

 Summary: agent: update_resource_provider w/ identical RP info 
should not always force-restart plugin
 Key: MESOS-8693
 URL: https://issues.apache.org/jira/browse/MESOS-8693
 Project: Mesos
  Issue Type: Task
Affects Versions: 1.5.0
Reporter: James DeFelice


Currently when the UPDATE_RESOURCE_PROVIDER call is sent to an agent, and the 
RP info of the request is identical to that of the running configuration, the 
agent force-restarts the related CSI plugin. This is surprising on two accounts:

First, because it increases the complexity of the client that wants to ensure 
the latest RP configuration is pushed to the agent. A CSI plugin may take a 
long time to become ready after being reconfigured. It's likely that a caller 
will experience a timeout while waiting for the RP to come into a healthy state 
w/ the desired configuration. Upon retrying the update, a client DOES NOT 
always wish to restart an ongoing reconfiguration effort – especially when for 
long running reconfiguration operations. Mesos should NOT restart the related 
CSI plugin by default if the new RP info matches the existing one, and instead 
should either return 409 or some other, more appropriate error code (409 would 
be nice/consistent, see below).

Second, because it differs from the idempotent nature of the 
ADD_RESOURCE_PROVIDER call, which does NOT change the state of the plugin in 
case of a duplicate request. The ADD_RESOURCE_PROVIDER call returns a 409 
response, which allows callers to simply re-issue redundant requests without 
concern for interrupting the state of a running plugin.

In the event that caller DOES want to force the restart of an underlying CSI 
plugin, I suggest that we extend the UPDATE_RESOURCE_PROVIDER call w/ a 
"force_restart" field (sibling to the "info" field). "force_restart == true" 
would only have meaning for updates that involve unchanged RP info, otherwise 
it would go unused.

/cc [~jieyu] [~chhsia0]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-7697) Mesos scheduler v1 HTTP API may generate 404 errors for temporary conditions

2018-03-27 Thread James DeFelice (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16415941#comment-16415941
 ] 

James DeFelice commented on MESOS-7697:
---

https://github.com/apache/mesos/blob/124c677c86c7b12ca4568f004895b8ca30d60dcf/3rdparty/libprocess/src/process.cpp#L3674

> Mesos scheduler v1 HTTP API may generate 404 errors for temporary conditions
> 
>
> Key: MESOS-7697
> URL: https://issues.apache.org/jira/browse/MESOS-7697
> Project: Mesos
>  Issue Type: Bug
>  Components: HTTP API, libprocess
>Reporter: James DeFelice
>Priority: Major
>  Labels: mesosphere
>
> Returning a 404 error for a condition that's a known temporary condition is 
> confusing from a client's perspective. A client wants to know how to recover 
> from various error conditions. A 404 error condition should be distinct from 
> a "server is not yet ready, but will be shortly" condition (which should 
> probably be reported as a 503 "unavailable" error).
> https://github.com/apache/mesos/blob/72752fc6deb8ebcbfbd5448dc599ef3774339d31/src/scheduler/scheduler.cpp#L593
> {code}
> if (response->code == process::http::Status::NOT_FOUND) {
>   // This could happen if the master libprocess process has not yet set up
>   // HTTP routes.
>   LOG(WARNING) << "Received '" << response->status << "' ("
><< response->body << ") for " << call.type();
>   return;
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-7697) Mesos scheduler v1 HTTP API may generate 404 errors for temporary conditions

2018-03-27 Thread James DeFelice (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16415941#comment-16415941
 ] 

James DeFelice edited comment on MESOS-7697 at 3/27/18 5:13 PM:


[https://github.com/apache/mesos/blob/124c677c86c7b12ca4568f004895b8ca30d60dcf/3rdparty/libprocess/src/process.cpp#L3674]

https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/process.cpp#L2729-L2740


was (Author: jdef):
https://github.com/apache/mesos/blob/124c677c86c7b12ca4568f004895b8ca30d60dcf/3rdparty/libprocess/src/process.cpp#L3674

> Mesos scheduler v1 HTTP API may generate 404 errors for temporary conditions
> 
>
> Key: MESOS-7697
> URL: https://issues.apache.org/jira/browse/MESOS-7697
> Project: Mesos
>  Issue Type: Bug
>  Components: HTTP API, libprocess
>Reporter: James DeFelice
>Priority: Major
>  Labels: mesosphere
>
> Returning a 404 error for a condition that's a known temporary condition is 
> confusing from a client's perspective. A client wants to know how to recover 
> from various error conditions. A 404 error condition should be distinct from 
> a "server is not yet ready, but will be shortly" condition (which should 
> probably be reported as a 503 "unavailable" error).
> https://github.com/apache/mesos/blob/72752fc6deb8ebcbfbd5448dc599ef3774339d31/src/scheduler/scheduler.cpp#L593
> {code}
> if (response->code == process::http::Status::NOT_FOUND) {
>   // This could happen if the master libprocess process has not yet set up
>   // HTTP routes.
>   LOG(WARNING) << "Received '" << response->status << "' ("
><< response->body << ") for " << call.type();
>   return;
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-7697) Mesos scheduler v1 HTTP API may generate 404 errors for temporary conditions

2018-03-27 Thread James DeFelice (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16415941#comment-16415941
 ] 

James DeFelice edited comment on MESOS-7697 at 3/27/18 5:14 PM:


[https://github.com/apache/mesos/blob/124c677c86c7b12ca4568f004895b8ca30d60dcf/3rdparty/libprocess/src/process.cpp#L3674]

https://github.com/apache/mesos/blob/124c677c86c7b12ca4568f004895b8ca30d60dcf/3rdparty/libprocess/src/process.cpp#L2729-L2740


was (Author: jdef):
[https://github.com/apache/mesos/blob/124c677c86c7b12ca4568f004895b8ca30d60dcf/3rdparty/libprocess/src/process.cpp#L3674]

https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/process.cpp#L2729-L2740

> Mesos scheduler v1 HTTP API may generate 404 errors for temporary conditions
> 
>
> Key: MESOS-7697
> URL: https://issues.apache.org/jira/browse/MESOS-7697
> Project: Mesos
>  Issue Type: Bug
>  Components: HTTP API, libprocess
>Reporter: James DeFelice
>Priority: Major
>  Labels: mesosphere
>
> Returning a 404 error for a condition that's a known temporary condition is 
> confusing from a client's perspective. A client wants to know how to recover 
> from various error conditions. A 404 error condition should be distinct from 
> a "server is not yet ready, but will be shortly" condition (which should 
> probably be reported as a 503 "unavailable" error).
> https://github.com/apache/mesos/blob/72752fc6deb8ebcbfbd5448dc599ef3774339d31/src/scheduler/scheduler.cpp#L593
> {code}
> if (response->code == process::http::Status::NOT_FOUND) {
>   // This could happen if the master libprocess process has not yet set up
>   // HTTP routes.
>   LOG(WARNING) << "Received '" << response->status << "' ("
><< response->body << ") for " << call.type();
>   return;
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-4065) slave FD for ZK tcp connection leaked to executor process

2018-04-23 Thread James DeFelice (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16448611#comment-16448611
 ] 

James DeFelice commented on MESOS-4065:
---

It looks like the linked ZK ticket was recently resolved.

> slave FD for ZK tcp connection leaked to executor process
> -
>
> Key: MESOS-4065
> URL: https://issues.apache.org/jira/browse/MESOS-4065
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.24.1, 0.25.0, 1.2.2
>Reporter: James DeFelice
>Priority: Major
>  Labels: mesosphere, security
>
> {code}
> core@ip-10-0-0-45 ~ $ ps auxwww|grep -e etcd
> root  1432 99.3  0.0 202420 12928 ?Rsl  21:32  13:51 
> ./etcd-mesos-executor -log_dir=./
> root  1450  0.4  0.1  38332 28752 ?Sl   21:32   0:03 ./etcd 
> --data-dir=etcd_data --name=etcd-1449178273 
> --listen-peer-urls=http://10.0.0.45:1025 
> --initial-advertise-peer-urls=http://10.0.0.45:1025 
> --listen-client-urls=http://10.0.0.45:1026 
> --advertise-client-urls=http://10.0.0.45:1026 
> --initial-cluster=etcd-1449178273=http://10.0.0.45:1025,etcd-1449178271=http://10.0.2.95:1025,etcd-1449178272=http://10.0.2.216:1025
>  --initial-cluster-state=existing
> core  1651  0.0  0.0   6740   928 pts/0S+   21:46   0:00 grep 
> --colour=auto -e etcd
> core@ip-10-0-0-45 ~ $ sudo lsof -p 1432|grep -e 2181
> etcd-meso 1432 root   10u IPv4  21973  0t0TCP 
> ip-10-0-0-45.us-west-2.compute.internal:54016->ip-10-0-5-206.us-west-2.compute.internal:2181
>  (ESTABLISHED)
> core@ip-10-0-0-45 ~ $ ps auxwww|grep -e slave
> root  1124  0.2  0.1 900496 25736 ?Ssl  21:11   0:04 
> /opt/mesosphere/packages/mesos--52cbecde74638029c3ba0ac5e5ab81df8debf0fa/sbin/mesos-slave
> core  1658  0.0  0.0   6740   832 pts/0S+   21:46   0:00 grep 
> --colour=auto -e slave
> core@ip-10-0-0-45 ~ $ sudo lsof -p 1124|grep -e 2181
> mesos-sla 1124 root   10u IPv4  21973  0t0TCP 
> ip-10-0-0-45.us-west-2.compute.internal:54016->ip-10-0-5-206.us-west-2.compute.internal:2181
>  (ESTABLISHED)
> {code}
> I only tested against mesos 0.24.1 and 0.25.0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8967) Comments for FaultDomain should include notes for convention pertaining to additional hierarchy.

2018-05-30 Thread James DeFelice (JIRA)
James DeFelice created MESOS-8967:
-

 Summary: Comments for FaultDomain should include notes for 
convention pertaining to additional hierarchy.
 Key: MESOS-8967
 URL: https://issues.apache.org/jira/browse/MESOS-8967
 Project: Mesos
  Issue Type: Task
Reporter: James DeFelice


The original design doc includes conventions for additional hierarchy. This 
commentary is missing from the protobuf and so it's easily missed.

https://docs.google.com/document/d/1gEugdkLRbBsqsiFv3urRPRNrHwUC-i1HwfFfHR_MvC8/edit#heading=h.emfys1xszpir



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9238) rpmbuild checkfiles fails

2018-09-15 Thread James DeFelice (JIRA)
James DeFelice created MESOS-9238:
-

 Summary: rpmbuild checkfiles fails
 Key: MESOS-9238
 URL: https://issues.apache.org/jira/browse/MESOS-9238
 Project: Mesos
  Issue Type: Bug
Reporter: James DeFelice


I noticed that Mesos nightly builds haven't been pushed to dockerhub in a 
while. After some help from Jie and digging a bit more it looks like rpm-build 
is reporting an error:
{code:java}
RPM build errors:
error: Installed (but unpackaged) file(s) found:
   /usr/include/rapidjson/allocators.h
   /usr/include/rapidjson/document.h
   /usr/include/rapidjson/encodedstream.h
   /usr/include/rapidjson/encodings.h
   /usr/include/rapidjson/error/en.h
   /usr/include/rapidjson/error/error.h
   /usr/include/rapidjson/filereadstream.h
   /usr/include/rapidjson/filewritestream.h
   /usr/include/rapidjson/fwd.h
   /usr/include/rapidjson/internal/biginteger.h
   /usr/include/rapidjson/internal/diyfp.h
   /usr/include/rapidjson/internal/dtoa.h
   /usr/include/rapidjson/internal/ieee754.h
   /usr/include/rapidjson/internal/itoa.h
   /usr/include/rapidjson/internal/meta.h
   /usr/include/rapidjson/internal/pow10.h
   /usr/include/rapidjson/internal/regex.h
   /usr/include/rapidjson/internal/stack.h
   /usr/include/rapidjson/internal/strfunc.h
   /usr/include/rapidjson/internal/strtod.h
   /usr/include/rapidjson/internal/swap.h
   /usr/include/rapidjson/istreamwrapper.h
   /usr/include/rapidjson/memorybuffer.h
   /usr/include/rapidjson/memorystream.h
   /usr/include/rapidjson/msinttypes/inttypes.h
   /usr/include/rapidjson/msinttypes/stdint.h
   /usr/include/rapidjson/ostreamwrapper.h
   /usr/include/rapidjson/pointer.h
   /usr/include/rapidjson/prettywriter.h
   /usr/include/rapidjson/rapidjson.h
   /usr/include/rapidjson/reader.h
   /usr/include/rapidjson/schema.h
   /usr/include/rapidjson/stream.h
   /usr/include/rapidjson/stringbuffer.h
   /usr/include/rapidjson/writer.h
Macro %MESOS_VERSION has empty body
Macro %MESOS_RELEASE has empty body
{code}
Furthermore, the cleanup func that's invoked by the trap is failing with a 
bunch of permission erors:
{code:java}
cleanup
rm: cannot remove 
'/home/jenkins/jenkins-slave/workspace/Mesos-Docker-CentOS/centos7/.cache': 
Permission denied
rm: cannot remove 
'/home/jenkins/jenkins-slave/workspace/Mesos-Docker-CentOS/centos7/rpmbuild/SRPMS':
 Permission denied
rm: cannot remove 
'/home/jenkins/jenkins-slave/workspace/Mesos-Docker-CentOS/centos7/rpmbuild/BUILDROOT/mesos-1.8.0-0.1.pre.20180915git4805a47.el7.x86_64/var/lib/mesos':
 Permission denied
rm: cannot remove 
'/home/jenkins/jenkins-slave/workspace/Mesos-Docker-CentOS/centos7/rpmbuild/BUILDROOT/mesos-1.8.0-0.1.pre.20180915git4805a47.el7.x86_64/var/log/mesos':
 Permission denied
...
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9293) OperationStatus messages sent to framework should include both agent ID and resource provider ID

2018-10-04 Thread James DeFelice (JIRA)
James DeFelice created MESOS-9293:
-

 Summary: OperationStatus messages sent to framework should include 
both agent ID and resource provider ID
 Key: MESOS-9293
 URL: https://issues.apache.org/jira/browse/MESOS-9293
 Project: Mesos
  Issue Type: Bug
Affects Versions: 1.7.0
Reporter: James DeFelice


Normally, frameworks are expected to checkpoint agent ID and resource provider 
ID before accepting an offer with an OfferOperation. From this expectation 
comes the requirement in the v1 scheduler API that a framework must provide the 
agent ID and resource provider ID when acknowledging an offer operation status 
update. However, this expectation breaks down:

1. the framework might lose its checkpointed data; it no longer remembers the 
agent ID or the resource provider ID

2. even if the framework checkpoints data, it could be sent a stale update: 
maybe the original ACK it sent to Mesos was lost, and it needs to re-ACK. If a 
framework deleted its checkpointed data after sending the ACK (that's dropped) 
then upon replay of the status update it no longer has the agent ID or resource 
provider ID for the operation.

An easy remedy would be to add the agent ID and resource provider ID to the 
OperationStatus message received by the scheduler so that a framework can build 
a proper ACK for the update, even if it doesn't have access to its previously 
checkpointed information.

I'm filing this a BUG because there's no way to reliably use the offer 
operation status API until this has been fixed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9223) Storage local provider does not sufficiently handle container launch failures or errors

2018-10-05 Thread James DeFelice (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16640285#comment-16640285
 ] 

James DeFelice commented on MESOS-9223:
---

Regardless of whether retries are implemented, it would be nice to have an API 
that exposed the reason for the error. e.g. the last log line, or Mesos error 
related to the container failure.

> Storage local provider does not sufficiently handle container launch failures 
> or errors
> ---
>
> Key: MESOS-9223
> URL: https://issues.apache.org/jira/browse/MESOS-9223
> Project: Mesos
>  Issue Type: Improvement
>  Components: agent, storage
>Reporter: Benjamin Bannier
>Assignee: Chun-Hung Hsiao
>Priority: Blocker
>
> The storage local resource provider as currently implemented does not handle 
> launch failures or task errors of its standalone containers well enough, If 
> e.g., a RP container fails to come up during node start a warning would be 
> logged, but an operator still needs to detect degraded functionality, 
> manually check the state of containers with {{GET_CONTAINERS}}, and decide 
> whether the agent needs restarting; I suspect they do not have always have 
> enough context for this decision. It would be better if the provider would 
> either enforce a restart by failing over the whole agent, or by retrying the 
> operation (optionally: up to some maximum amount of retries).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9308) URI disk profile adaptor could deadlock.

2018-10-11 Thread James DeFelice (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16646563#comment-16646563
 ] 

James DeFelice commented on MESOS-9308:
---

I'm wondering why this wasn't caught by a unit test. Maybe we need more unit 
tests around helpers like this. Tech debt?

> URI disk profile adaptor could deadlock.
> 
>
> Key: MESOS-9308
> URL: https://issues.apache.org/jira/browse/MESOS-9308
> Project: Mesos
>  Issue Type: Bug
>  Components: storage
>Affects Versions: 1.5.1, 1.6.1, 1.7.0
>Reporter: Jie Yu
>Assignee: Chun-Hung Hsiao
>Priority: Critical
>  Labels: mesosphere, storage
> Fix For: 1.5.2, 1.6.2, 1.7.1, 1.8.0
>
>
> The loop here can be infinit:
> https://github.com/apache/mesos/blob/1.7.0/src/resource_provider/storage/uri_disk_profile_adaptor.cpp#L61-L80



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9313) Document speculative offer operation semantics for framework writers.

2018-10-12 Thread James DeFelice (JIRA)
James DeFelice created MESOS-9313:
-

 Summary: Document speculative offer operation semantics for 
framework writers.
 Key: MESOS-9313
 URL: https://issues.apache.org/jira/browse/MESOS-9313
 Project: Mesos
  Issue Type: Documentation
Reporter: James DeFelice


It recently came to my attention that a subset of offer operations (e.g. 
RESERVE, UNRESERVE, et al.) are implemented speculatively within mesos master. 
Meaning that the master will apply the resource conversion internally 
**before** the conversion is checkpointed on the agent. The master may then 
re-offer the converted resource to a framework -- even though the agent may 
still not have checkpointed the resource conversion. If the checkpointing 
process on the agent fails, then subsequent operations issued for the 
falsely-offered resource will fail. Because the master essentially "lied" to 
the framework about the true state of the supposedly-converted resource.

It's also been explained to me that this case is expected to be rare. However, 
it *can* impact the design/implementation of framework state machines and so 
it's critical that this information be documented clearly - outside of the C++ 
code base.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9318) Consider providing better operation status updates while an RP is recovering

2018-10-15 Thread James DeFelice (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16650756#comment-16650756
 ] 

James DeFelice commented on MESOS-9318:
---

Yes to this. It's pretty annoying to deal with this kind of UKNOWN otherwise.

> Consider providing better operation status updates while an RP is recovering
> 
>
> Key: MESOS-9318
> URL: https://issues.apache.org/jira/browse/MESOS-9318
> Project: Mesos
>  Issue Type: Task
>Affects Versions: 1.6.0, 1.7.0
>Reporter: Gastón Kleiman
>Priority: Major
>  Labels: mesosphere, operation-feedback
>
> Consider the following scenario:
> 1. A framework accepts an offer with an operation affecting SLRP resources.
> 2. The master forwards it to the corresponding agent.
> 3. The agent forwards it to the corresponding RP.
> 4. The agent and the master fail over.
> 5. The master recovers.
> 6. The agent recovers while the RP is still recovering, so it doesn't include 
> the pending operation on the {{RegisterMessage}}.
> 7. A framework performs an explicit operation status reconciliation.
> In this case the master will currently respond with {{OPERATION_UNKNOWN}}, 
> but it should be possible to respond with a more fine-grained and useful 
> state, such as {{OPERATION_RECOVERING}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9352) Data in persistent volume deleted accidentally when using Docker container and Persistent volume

2018-10-24 Thread James DeFelice (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16662650#comment-16662650
 ] 

James DeFelice commented on MESOS-9352:
---

Seemingly related MESOS-9049, MESOS-8830, MESOS-2408

> Data in persistent volume deleted accidentally when using Docker container 
> and Persistent volume
> 
>
> Key: MESOS-9352
> URL: https://issues.apache.org/jira/browse/MESOS-9352
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, containerization, docker
>Affects Versions: 1.5.1, 1.5.2
> Environment: DCOS 1.11.6
> Mesos 1.5.2
>Reporter: David Ko
>Priority: Critical
>  Labels: mesosphere, persistent-volumes
> Attachments: image-2018-10-24-22-20-51-059.png, 
> image-2018-10-24-22-21-13-399.png
>
>
> Using docker image w/ persistent volume to start a service, it will cause 
> data in persistent volume deleted accidentally when task killed and 
> restarted, also old mount points not unmounted, even the service already 
> deleted. 
> *The expected result should be data in persistent volume kept until task 
> deleted completely, also dangling mount points should be unmounted correctly.*
>  
> *Step 1:* Use below JSON config to create a Mysql server using Docker image 
> and Persistent Volume
> {code:javascript}
> {
>   "env": {
> "MYSQL_USER": "wordpress",
> "MYSQL_PASSWORD": "secret",
> "MYSQL_ROOT_PASSWORD": "supersecret",
> "MYSQL_DATABASE": "wordpress"
>   },
>   "id": "/mysqlgc",
>   "backoffFactor": 1.15,
>   "backoffSeconds": 1,
>   "constraints": [
> [
>   "hostname",
>   "IS",
>   "172.27.12.216"
> ]
>   ],
>   "container": {
> "portMappings": [
>   {
> "containerPort": 3306,
> "hostPort": 0,
> "protocol": "tcp",
> "servicePort": 1
>   }
> ],
> "type": "DOCKER",
> "volumes": [
>   {
> "persistent": {
>   "type": "root",
>   "size": 1000,
>   "constraints": []
> },
> "mode": "RW",
> "containerPath": "mysqldata"
>   },
>   {
> "containerPath": "/var/lib/mysql",
> "hostPath": "mysqldata",
> "mode": "RW"
>   }
> ],
> "docker": {
>   "image": "mysql",
>   "forcePullImage": false,
>   "privileged": false,
>   "parameters": []
> }
>   },
>   "cpus": 1,
>   "disk": 0,
>   "instances": 1,
>   "maxLaunchDelaySeconds": 3600,
>   "mem": 512,
>   "gpus": 0,
>   "networks": [
> {
>   "mode": "container/bridge"
> }
>   ],
>   "residency": {
> "relaunchEscalationTimeoutSeconds": 3600,
> "taskLostBehavior": "WAIT_FOREVER"
>   },
>   "requirePorts": false,
>   "upgradeStrategy": {
> "maximumOverCapacity": 0,
> "minimumHealthCapacity": 0
>   },
>   "killSelection": "YOUNGEST_FIRST",
>   "unreachableStrategy": "disabled",
>   "healthChecks": [],
>   "fetch": []
> }
> {code}
> *Step 2:* Kill mysqld process to force rescheduling new Mysql task, but found 
> 2 mount points to the same persistent volume, it means old mount point did 
> not be unmounted immediately.
> !image-2018-10-24-22-20-51-059.png!
> *Step 3:* After GC, data in persistent volume was deleted accidentally, but 
> mysqld (Mesos task) still running
> !image-2018-10-24-22-21-13-399.png!
> *Step 4:* Delete Mysql service from Marathon, all mount points unable to 
> unmount, even the service already deleted.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9223) Storage local provider does not sufficiently handle container launch failures or errors

2018-12-03 Thread James DeFelice (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16707318#comment-16707318
 ] 

James DeFelice commented on MESOS-9223:
---

MESOS-8380 addresses UI changes. The UI should not be the only place to easily 
observe/troubleshoot errors. Ideally there'd be an API that exposes such.

> Storage local provider does not sufficiently handle container launch failures 
> or errors
> ---
>
> Key: MESOS-9223
> URL: https://issues.apache.org/jira/browse/MESOS-9223
> Project: Mesos
>  Issue Type: Improvement
>  Components: agent, storage
>Reporter: Benjamin Bannier
>Priority: Critical
>
> The storage local resource provider as currently implemented does not handle 
> launch failures or task errors of its standalone containers well enough, If 
> e.g., a RP container fails to come up during node start a warning would be 
> logged, but an operator still needs to detect degraded functionality, 
> manually check the state of containers with {{GET_CONTAINERS}}, and decide 
> whether the agent needs restarting; I suspect they do not have always have 
> enough context for this decision. It would be better if the provider would 
> either enforce a restart by failing over the whole agent, or by retrying the 
> operation (optionally: up to some maximum amount of retries).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9517) SLRP should treat gRPC timeouts as non-terminal errors, instead of reporting OPERATION_FAILED.

2019-01-09 Thread James DeFelice (JIRA)
James DeFelice created MESOS-9517:
-

 Summary: SLRP should treat gRPC timeouts as non-terminal errors, 
instead of reporting OPERATION_FAILED.
 Key: MESOS-9517
 URL: https://issues.apache.org/jira/browse/MESOS-9517
 Project: Mesos
  Issue Type: Bug
  Components: resource provider, storage
Reporter: James DeFelice
Assignee: Chun-Hung Hsiao


1. framework executes a CREATE_DISK operation.
2. The SLRP issues a CreateVolume RPC to the plugin
3. The RPC call times out
4. The agent/SLRP translates non-terminal gRPC timeout errors 
(DeadlineExceeded) for "CreateVolume" calls into OPERATION_FAILED, which is 
terminal.
5. framework receives a *terminal* OPERATION_FAILED status, so it executes 
another CREATE_DISK operation.
6. The second CREATE_DISK operation does not timeout.
7. The first CREATE_DISK operation was actually completed by the plugin, 
unbeknownst to the SLRP.
8. There's now an orphan volume in the storage system that no one is tracking.

Proposed solution: the SLRP makes more intelligent decisions about non-terminal 
gRPC errors. For example, timeouts are likely expected for potentially 
long-running storage operations and should not be considered terminal. In such 
cases, the SLRP should NOT report OPERATION_FAILED and instead should re-issue 
the **same** (idempotent) CreateVolume call to the plugin to ascertain the 
status of the requested volume creation.

Agent logs for the 3 orphan vols above:
{code}
[jdefelice@ec101 DCOS-46889]$ grep -e 3bd1a1a9-43d3-485c-9275-59cebd64b07c 
agent.log
Jan 09 11:10:27 ip-10-10-0-28.us-west-2.compute.internal mesos-agent[13170]: 
I0109 11:10:27.896306 13189 provider.cpp:1548] Received CREATE_DISK operation 
'a1BdfrEhy4ZLSNPZbDrzp1h-0' (uuid: 3bd1a1a9-43d3-485c-9275-59cebd64b07c)
Jan 09 11:11:27 ip-10-10-0-28.us-west-2.compute.internal mesos-agent[13170]: 
E0109 11:11:27.904057 13190 provider.cpp:1605] Failed to apply operation (uuid: 
3bd1a1a9-43d3-485c-9275-59cebd64b07c): Deadline Exceeded
Jan 09 11:11:27 ip-10-10-0-28.us-west-2.compute.internal mesos-agent[13170]: 
I0109 11:11:27.904058 13192 status_update_manager_process.hpp:152] Received 
operation status update OPERATION_FAILED (Status UUID: 
8c1ddad1-4adb-4df5-91fe-235d265a71d8) for operation UUID 
3bd1a1a9-43d3-485c-9275-59cebd64b07c (framework-supplied ID 
'a1BdfrEhy4ZLSNPZbDrzp1h-0') of framework 
'c0b7cc7e-db35-450d-bf25-9e3183a07161-0002' on agent 
c0b7cc7e-db35-450d-bf25-9e3183a07161-S1
Jan 09 11:11:27 ip-10-10-0-28.us-west-2.compute.internal mesos-agent[13170]: 
I0109 11:11:27.904331 13192 status_update_manager_process.hpp:929] 
Checkpointing UPDATE for operation status update OPERATION_FAILED (Status UUID: 
8c1ddad1-4adb-4df5-91fe-235d265a71d8) for operation UUID 
3bd1a1a9-43d3-485c-9275-59cebd64b07c (framework-supplied ID 
'a1BdfrEhy4ZLSNPZbDrzp1h-0') of framework 
'c0b7cc7e-db35-450d-bf25-9e3183a07161-0002' on agent 
c0b7cc7e-db35-450d-bf25-9e3183a07161-S1
Jan 09 11:11:27 ip-10-10-0-28.us-west-2.compute.internal mesos-agent[13170]: 
I0109 11:11:27.947286 13189 slave.cpp:7696] Handling resource provider message 
'UPDATE_OPERATION_STATUS: (uuid: 3bd1a1a9-43d3-485c-9275-59cebd64b07c) for 
framework c0b7cc7e-db35-450d-bf25-9e3183a07161-0002 (latest state: 
OPERATION_FAILED, status update state: OPERATION_FAILED)'
Jan 09 11:11:27 ip-10-10-0-28.us-west-2.compute.internal mesos-agent[13170]: 
I0109 11:11:27.947376 13189 slave.cpp:8034] Updating the state of operation 
'a1BdfrEhy4ZLSNPZbDrzp1h-0' (uuid: 3bd1a1a9-43d3-485c-9275-59cebd64b07c) for 
framework c0b7cc7e-db35-450d-bf25-9e3183a07161-0002 (latest state: 
OPERATION_FAILED, status update state: OPERATION_FAILED)
Jan 09 11:11:27 ip-10-10-0-28.us-west-2.compute.internal mesos-agent[13170]: 
I0109 11:11:27.947407 13189 slave.cpp:7890] Forwarding status update of 
operation 'a1BdfrEhy4ZLSNPZbDrzp1h-0' (operation_uuid: 
3bd1a1a9-43d3-485c-9275-59cebd64b07c) for framework 
c0b7cc7e-db35-450d-bf25-9e3183a07161-0002
Jan 09 11:11:27 ip-10-10-0-28.us-west-2.compute.internal mesos-agent[13170]: 
I0109 11:11:27.952689 13193 status_update_manager_process.hpp:252] Received 
operation status update acknowledgement (UUID: 
8c1ddad1-4adb-4df5-91fe-235d265a71d8) for stream 
3bd1a1a9-43d3-485c-9275-59cebd64b07c
Jan 09 11:11:27 ip-10-10-0-28.us-west-2.compute.internal mesos-agent[13170]: 
I0109 11:11:27.952725 13193 status_update_manager_process.hpp:929] 
Checkpointing ACK for operation status update OPERATION_FAILED (Status UUID: 
8c1ddad1-4adb-4df5-91fe-235d265a71d8) for operation UUID 
3bd1a1a9-43d3-485c-9275-59cebd64b07c (framework-supplied ID 
'a1BdfrEhy4ZLSNPZbDrzp1h-0') of framework 
'c0b7cc7e-db35-450d-bf25-9e3183a07161-0002' on agent 
c0b7cc7e-db35-450d-bf25-9e3183a07161-S1
[jdefelice@ec101 DCOS-46889]$ grep -e 4acf1495-1a36-4939-a71b-75ca5aa73657 
agent.log
Jan 09 11:10:28 ip-10-10-0-28.us-west-2.compute.internal mesos-agent[131

[jira] [Commented] (MESOS-9523) Add per-framework allocatable resources matcher/filter.

2019-01-16 Thread James DeFelice (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16744105#comment-16744105
 ] 

James DeFelice commented on MESOS-9523:
---

Is there already a design that justifies support for complex expressions beyond 
"min_allocatable_resources"? if so, would you mind dropping link in the 
description of this ticket?

> Add per-framework allocatable resources matcher/filter.
> ---
>
> Key: MESOS-9523
> URL: https://issues.apache.org/jira/browse/MESOS-9523
> Project: Mesos
>  Issue Type: Improvement
>  Components: allocation
>Reporter: Meng Zhu
>Assignee: Benjamin Bannier
>Priority: Major
>  Labels: mesosphere, storage
>
> Currently, Mesos has a single global flag `min_allocatable_resources` that 
> provides some control over the shape of the offer. But, being a global flag, 
> finding a one-size-fits-all shape is hard and less than ideal. It will be 
> great if frameworks can specify different shapes based on their needs. 
> In addition to extending this flag to be per-framework. It is also a good 
> opportunity to see if it can be more than `min_alloctable` e.g. providing 
> more predicates such as max,  (not) contain and etc. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9586) mesos/mesos-centos nightly images should include development headers

2019-02-19 Thread James DeFelice (JIRA)
James DeFelice created MESOS-9586:
-

 Summary: mesos/mesos-centos nightly images should include 
development headers
 Key: MESOS-9586
 URL: https://issues.apache.org/jira/browse/MESOS-9586
 Project: Mesos
  Issue Type: Improvement
Reporter: James DeFelice


The existing Mesos nightly images greatly simplify the process of tracking the 
Mesos master branch w/ integration tests. Our integration tests now have a new 
requirement: we'd like to build Mesos modules against the latest master 
nightlies. This is difficult with the current mesos/mesos-centos dockerhub 
images because they don't include the development headers. Ideally, these 
headers would be available in this (or a sibling) image.

 

[https://github.com/apache/mesos/blob/master/support/packaging/centos/build-docker-centos.sh#L25-L29]

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9590) Mesos CI sometimes, incorrectly, overwrites already-pushed mesos master nightly images with new images built from non-master branches.

2019-02-20 Thread James DeFelice (JIRA)
James DeFelice created MESOS-9590:
-

 Summary: Mesos CI sometimes, incorrectly, overwrites 
already-pushed mesos master nightly images with new images built from 
non-master branches.
 Key: MESOS-9590
 URL: https://issues.apache.org/jira/browse/MESOS-9590
 Project: Mesos
  Issue Type: Bug
Reporter: James DeFelice
Assignee: Jie Yu


I pulled image mesos/mesos-centos:master-2019-02-15 some time on the 15th and 
worked with it locally, on my laptop, for about a week. Part of that work 
included downloading the related mesos-xxx-devel.rpm from the same CI build 
that produced the image so that I could build 3rd party mesos modules from the 
master base image. The rpm was labeled as pre-1.8.0.

This worked great until I tried to repeat the work on another machine. The 
other machine pulled the "same" dockerhub image 
(mesos/mesos-centos:master-2019-02-15) which was somehow built with a 
mesos-xxx.rpm labeled as pre-1.7.2. I couldn't build my docker image using this 
strangely new base because the mesos-xxx-devel.rpm I had hardcoded into the 
dockerfile no longer aligned with the version of the mesos RPM that was 
shipping in the base image.

The base image had changed, such that the mesos RPM version went from 1.8.0 to 
1.7.2. This should never happen.

[~jieyu] investigated and found that the problem appears to happen at random. 
Current thinking is that one of the mesos CI boxes uses a version of git that's 
too old, and that the CI scripts are incorrectly ignoring a git command 
failure: the git command fails because the git version is too old, and the 
script subsequently ignores any failures from the command pipeline in which 
this command is executed. With the result being that the "version" of the 
branch being built cannot be detected and therefore defaults to master - 
overwriting *actual* master image builds.

[~jieyu] also wrote some patches, which I'll link here:

* https://reviews.apache.org/r/70024/
* https://reviews.apache.org/r/70025/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


  1   2   >