[jira] [Updated] (MESOS-1806) Substituting etcd for Zookeeper

2016-04-20 Thread Shuai Lin (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shuai Lin updated MESOS-1806:
-
Shepherd: Kapil Arya  (was: Benjamin Hindman)

> Substituting etcd for Zookeeper
> ---
>
> Key: MESOS-1806
> URL: https://issues.apache.org/jira/browse/MESOS-1806
> Project: Mesos
>  Issue Type: Task
>  Components: leader election
>Reporter: Ed Ropple
>Assignee: Shuai Lin
>Priority: Minor
>
>eropple: Could you also file a new JIRA for Mesos to drop ZK 
> in favor of etcd or ReplicatedLog? Would love to get some momentum going on 
> that one.
> --
> Consider it filed. =)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-5239) Persistent volume DockerContainerizer support assumes proper mount propagation setup on the host.

2016-04-20 Thread Jie Yu (JIRA)
Jie Yu created MESOS-5239:
-

 Summary: Persistent volume DockerContainerizer support assumes 
proper mount propagation setup on the host.
 Key: MESOS-5239
 URL: https://issues.apache.org/jira/browse/MESOS-5239
 Project: Mesos
  Issue Type: Bug
  Components: containerization
Affects Versions: 0.28.1, 0.28.0
Reporter: Jie Yu


We recently added persistent volume support in DockerContainerizer 
(MESOS-3413). To understand the problem, we first need to understand how 
persistent volumes are supported in DockerContainerizer.

To support persistent volumes in DockerContainerizer, we bind mount persistent 
volumes under a container's sandbox ('container_path' has to be relative for 
persistent volumes). When the Docker container is launched, since we always add 
a volume (-v) for the sandbox, the persistent volumes will be bind mounted into 
the container as well (since Docker does a 'rbind').

The assumption that the above works is that the Docker daemon should see those 
persistent volume mounts that Mesos mounts on the host mount table. It's not a 
problem if Docker daemon itself is using the host mount namespace. However, on 
systemd enabled systems, Docker daemon is running in a separate mount namespace 
and all mounts in that mount namespace will be marked as slave mounts due to 
this 
[patch|https://github.com/docker/docker/commit/eb76cb2301fc883941bc4ca2d9ebc3a486ab8e0a].

So what that means is that: in order for it to work, the parent mount of 
agent's work_dir should be a shared mount when docker daemon starts. This is 
typically true on CentOS7, CoreOS as all mounts are shared mounts by default.

However, this causes an issue with the 'filesystem/linux' isolator. To 
understand why, first I need to show you a typical problem when dealing with 
shared mounts. Let me explain that using the following commands on a CentOS7 
machine:
{noformat}
[root@core-dev run]# cat /proc/self/mountinfo
24 60 0:19 / /run rw,nosuid,nodev shared:22 - tmpfs tmpfs rw,seclabel,mode=755
[root@core-dev run]# mkdir /run/netns
[root@core-dev run]# mount --bind /run/netns /run/netns
[root@core-dev run]# cat /proc/self/mountinfo
24 60 0:19 / /run rw,nosuid,nodev shared:22 - tmpfs tmpfs rw,seclabel,mode=755
121 24 0:19 /netns /run/netns rw,nosuid,nodev shared:22 - tmpfs tmpfs 
rw,seclabel,mode=755
[root@core-dev run]# ip netns add test
[root@core-dev run]# cat /proc/self/mountinfo
24 60 0:19 / /run rw,nosuid,nodev shared:22 - tmpfs tmpfs rw,seclabel,mode=755
121 24 0:19 /netns /run/netns rw,nosuid,nodev shared:22 - tmpfs tmpfs 
rw,seclabel,mode=755
162 121 0:3 / /run/netns/test rw,nosuid,nodev,noexec,relatime shared:5 - proc 
proc rw
163 24 0:3 / /run/netns/test rw,nosuid,nodev,noexec,relatime shared:5 - proc 
proc rw
{noformat}

As you can see above, there're two entries (/run/netns/test) in the mount table 
(unexpected). This will confuse some systems sometimes. The reason is because 
when we create a self bind mount (/run/netns -> /run/netns), the mount will be 
put into the same shared mount peer group (shared:22) as its parent (/run). 
Then, when you create another mount underneath that (/run/netns/test), that 
mount operation will be propagated to all mounts in the same peer group 
(shared:22), resulting an unexpected additional mount being created.

The reason we need to do a self bind mount in Mesos is that sometimes, we need 
to make sure some mounts are shared so that it does not get copied when a new 
mount namespace is created. However, on some systems, mounts are private by 
default (e.g., Ubuntu 14.04). In those cases, since we cannot change the system 
mounts, we have to do a self bind mount so that we can set mount propagation to 
shared. For instance, in filesytem/linux isolator, we do a self bind mount on 
agent's work_dir.

To avoid the self bind mount pitfall mentioned above, in filesystem/linux 
isolator, after we created the mount, we do a make-slave + make-shared so that 
the mount is its own shared mount peer group. In that way, any mounts 
underneath it will not be propagated back.

However, that operation will break the assumption that the persistent volume 
DockerContainerizer support makes. As a result, we're seeing problem with 
persistent volumes in DockerContainerizer when filesystem/linux isolator is 
turned on.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3402) mesos-execute does not support credentials

2016-04-20 Thread Tim Anderegg (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15251106#comment-15251106
 ] 

Tim Anderegg commented on MESOS-3402:
-

Patch draft completed, awaiting review: https://reviews.apache.org/r/46469/

> mesos-execute does not support credentials
> --
>
> Key: MESOS-3402
> URL: https://issues.apache.org/jira/browse/MESOS-3402
> Project: Mesos
>  Issue Type: Bug
>Reporter: Evan Krall
>Assignee: Tim Anderegg
>
> mesos-execute does not appear to support passing credentials. This makes it 
> impossible to use on a cluster where framework authentication is required.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5238) CHECK failure in AppcProvisionerIntegrationTest.ROOT_SimpleLinuxImageTest

2016-04-20 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu updated MESOS-5238:
--
Fix Version/s: 0.29.0

> CHECK failure in AppcProvisionerIntegrationTest.ROOT_SimpleLinuxImageTest
> -
>
> Key: MESOS-5238
> URL: https://issues.apache.org/jira/browse/MESOS-5238
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.28.0, 0.28.1
> Environment: CentOS 7 + SSL, x86-64
>Reporter: Neil Conway
>  Labels: flaky, mesosphere
> Fix For: 0.29.0
>
> Attachments: 5238_check_failure.txt
>
>
> Observed on the Mesosphere internal CI:
> {noformat}
> [22:56:28]W: [Step 10/10] F0420 22:56:28.056788   629 
> containerizer.cpp:1634] Check failed: containers_.contains(containerId)
> {noformat}
> Complete test log will be attached as a file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5238) CHECK failure in AppcProvisionerIntegrationTest.ROOT_SimpleLinuxImageTest

2016-04-20 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu updated MESOS-5238:
--
Affects Version/s: 0.28.0
   0.28.1

> CHECK failure in AppcProvisionerIntegrationTest.ROOT_SimpleLinuxImageTest
> -
>
> Key: MESOS-5238
> URL: https://issues.apache.org/jira/browse/MESOS-5238
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.28.0, 0.28.1
> Environment: CentOS 7 + SSL, x86-64
>Reporter: Neil Conway
>  Labels: flaky, mesosphere
> Attachments: 5238_check_failure.txt
>
>
> Observed on the Mesosphere internal CI:
> {noformat}
> [22:56:28]W: [Step 10/10] F0420 22:56:28.056788   629 
> containerizer.cpp:1634] Check failed: containers_.contains(containerId)
> {noformat}
> Complete test log will be attached as a file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5238) CHECK failure in AppcProvisionerIntegrationTest.ROOT_SimpleLinuxImageTest

2016-04-20 Thread Jie Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15251040#comment-15251040
 ] 

Jie Yu commented on MESOS-5238:
---

NVM, i think this problem is related to [~gilbert]'s recent change. He'll 
comment on this ticket for findings.

> CHECK failure in AppcProvisionerIntegrationTest.ROOT_SimpleLinuxImageTest
> -
>
> Key: MESOS-5238
> URL: https://issues.apache.org/jira/browse/MESOS-5238
> Project: Mesos
>  Issue Type: Bug
> Environment: CentOS 7 + SSL, x86-64
>Reporter: Neil Conway
>  Labels: flaky, mesosphere
> Attachments: 5238_check_failure.txt
>
>
> Observed on the Mesosphere internal CI:
> {noformat}
> [22:56:28]W: [Step 10/10] F0420 22:56:28.056788   629 
> containerizer.cpp:1634] Check failed: containers_.contains(containerId)
> {noformat}
> Complete test log will be attached as a file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5238) CHECK failure in AppcProvisionerIntegrationTest.ROOT_SimpleLinuxImageTest

2016-04-20 Thread Jie Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15251018#comment-15251018
 ] 

Jie Yu commented on MESOS-5238:
---

[~jojy] Can you take a look?

> CHECK failure in AppcProvisionerIntegrationTest.ROOT_SimpleLinuxImageTest
> -
>
> Key: MESOS-5238
> URL: https://issues.apache.org/jira/browse/MESOS-5238
> Project: Mesos
>  Issue Type: Bug
> Environment: CentOS 7 + SSL, x86-64
>Reporter: Neil Conway
>  Labels: flaky, mesosphere
> Attachments: 5238_check_failure.txt
>
>
> Observed on the Mesosphere internal CI:
> {noformat}
> [22:56:28]W: [Step 10/10] F0420 22:56:28.056788   629 
> containerizer.cpp:1634] Check failed: containers_.contains(containerId)
> {noformat}
> Complete test log will be attached as a file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5238) CHECK failure in AppcProvisionerIntegrationTest.ROOT_SimpleLinuxImageTest

2016-04-20 Thread Neil Conway (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neil Conway updated MESOS-5238:
---
Attachment: 5238_check_failure.txt

> CHECK failure in AppcProvisionerIntegrationTest.ROOT_SimpleLinuxImageTest
> -
>
> Key: MESOS-5238
> URL: https://issues.apache.org/jira/browse/MESOS-5238
> Project: Mesos
>  Issue Type: Bug
> Environment: CentOS 7 + SSL, x86-64
>Reporter: Neil Conway
>  Labels: flaky, mesosphere
> Attachments: 5238_check_failure.txt
>
>
> Observed on the Mesosphere internal CI:
> {noformat}
> [22:56:28]W: [Step 10/10] F0420 22:56:28.056788   629 
> containerizer.cpp:1634] Check failed: containers_.contains(containerId)
> {noformat}
> Complete test log will be attached as a file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-5238) CHECK failure in AppcProvisionerIntegrationTest.ROOT_SimpleLinuxImageTest

2016-04-20 Thread Neil Conway (JIRA)
Neil Conway created MESOS-5238:
--

 Summary: CHECK failure in 
AppcProvisionerIntegrationTest.ROOT_SimpleLinuxImageTest
 Key: MESOS-5238
 URL: https://issues.apache.org/jira/browse/MESOS-5238
 Project: Mesos
  Issue Type: Bug
 Environment: CentOS 7 + SSL, x86-64
Reporter: Neil Conway


Observed on the Mesosphere internal CI:

{noformat}
[22:56:28]W: [Step 10/10] F0420 22:56:28.056788   629 
containerizer.cpp:1634] Check failed: containers_.contains(containerId)
{noformat}

Complete test log will be attached as a file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5013) Add docker volume driver isolator for Mesos containerizer.

2016-04-20 Thread Jie Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15250797#comment-15250797
 ] 

Jie Yu commented on MESOS-5013:
---

commit a823a7285e5e6becb198eef0fc8ae716b9e66126
Author: Guangya Liu 
Date:   Wed Apr 20 14:33:01 2016 -0700

Implemented create() for docker volume isolator.

Review: https://reviews.apache.org/r/46180/

> Add docker volume driver isolator for Mesos containerizer.
> --
>
> Key: MESOS-5013
> URL: https://issues.apache.org/jira/browse/MESOS-5013
> Project: Mesos
>  Issue Type: Bug
>Reporter: Guangya Liu
>Assignee: Guangya Liu
>  Labels: mesosphere
>
> The isolator will interact with Docker Volume Driver Plugins to mount and 
> unmount external volumes to container.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5013) Add docker volume driver isolator for Mesos containerizer.

2016-04-20 Thread Jie Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15250748#comment-15250748
 ] 

Jie Yu commented on MESOS-5013:
---

commit d2d44d5e86cb2b76fcb2025b87031bc47456d035
Author: Guangya Liu 
Date:   Wed Apr 20 12:04:25 2016 -0700

Added docker volume driver client for mount and unmount.

Review: https://reviews.apache.org/r/45360/

> Add docker volume driver isolator for Mesos containerizer.
> --
>
> Key: MESOS-5013
> URL: https://issues.apache.org/jira/browse/MESOS-5013
> Project: Mesos
>  Issue Type: Bug
>Reporter: Guangya Liu
>Assignee: Guangya Liu
>  Labels: mesosphere
>
> The isolator will interact with Docker Volume Driver Plugins to mount and 
> unmount external volumes to container.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5013) Add docker volume driver isolator for Mesos containerizer.

2016-04-20 Thread Jie Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15250508#comment-15250508
 ] 

Jie Yu commented on MESOS-5013:
---

commit 17c954ac1d395192d343eb62409d03e14b851e67
Author: Guangya Liu 
Date:   Wed Apr 20 11:18:53 2016 -0700

Added paths helper function for docker volume checkpoint.

Review: https://reviews.apache.org/r/46426

> Add docker volume driver isolator for Mesos containerizer.
> --
>
> Key: MESOS-5013
> URL: https://issues.apache.org/jira/browse/MESOS-5013
> Project: Mesos
>  Issue Type: Bug
>Reporter: Guangya Liu
>Assignee: Guangya Liu
>  Labels: mesosphere
>
> The isolator will interact with Docker Volume Driver Plugins to mount and 
> unmount external volumes to container.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5013) Add docker volume driver isolator for Mesos containerizer.

2016-04-20 Thread Jie Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15250357#comment-15250357
 ] 

Jie Yu commented on MESOS-5013:
---

commit d02108fc59d0e6b6e6b2e072e84eb79da370fdc3
Author: Guangya Liu 
Date:   Wed Apr 20 10:04:37 2016 -0700

Added state protobuf for DockerVolume.

The DockerVolume is used to checkpoint volume information for each
container. It'll be used during recovery.

Review: https://reviews.apache.org/r/45270/

> Add docker volume driver isolator for Mesos containerizer.
> --
>
> Key: MESOS-5013
> URL: https://issues.apache.org/jira/browse/MESOS-5013
> Project: Mesos
>  Issue Type: Bug
>Reporter: Guangya Liu
>Assignee: Guangya Liu
>  Labels: mesosphere
>
> The isolator will interact with Docker Volume Driver Plugins to mount and 
> unmount external volumes to container.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5219) Add security headers to HTTP response

2016-04-20 Thread Don Laidlaw (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15250345#comment-15250345
 ] 

Don Laidlaw commented on MESOS-5219:


>From Wikipedia:
{quote}
Clickjacking is possible because seemingly harmless features of HTML web pages 
can be employed to perform unexpected actions.

A clickjacked page tricks a user into performing undesired actions by clicking 
on a concealed link. On a clickjacked page, the attackers load another page 
over it in a transparent layer. The users think that they are clicking visible 
buttons, while they are actually performing actions on the hidden/invisible 
page. The hidden page may be an authentic page; therefore, the attackers can 
trick users into performing actions which the users never intended. There is no 
way of tracing such actions to the attackers later, as the users would have 
been genuinely authenticated on the hidden page.
{quote}

The worst part about clickjacking is that the attack can happen even if the 
attacker does not have access to the server being attacked. If the user of the 
web browser can access the mesos host:port, then that is enough to allow a 
clickjacking attack.

There is some good information at OWASP about defending against clickjacking: 
[https://www.owasp.org/index.php/Clickjacking_Defense_Cheat_Sheet#Introduction] 
And the Wikipedia page describes it very well also: 
[https://en.wikipedia.org/wiki/Clickjacking] Both document the X-Frame-Options 
solution

I would recommend making the addition of the X-Frame-Options header to http 
responses optional by adding a startup option. If the option is not provided, 
then do not create the X-Frame-Options header, if the option is provided, then 
set the header to the value specified by the user.

The same is true for cross-site scripting. See 
[https://en.wikipedia.org/wiki/Cross-site_scripting] and 
[https://www.owasp.org/index.php/XSS_(Cross_Site_Scripting)_Prevention_Cheat_Sheet]

> Add security headers to HTTP response
> -
>
> Key: MESOS-5219
> URL: https://issues.apache.org/jira/browse/MESOS-5219
> Project: Mesos
>  Issue Type: Improvement
>  Components: HTTP API
>Reporter: Don Laidlaw
>
> Cross site scripting and click jacking are major concerns. Many issues can be 
> resolved by setting some headers in the HTTP responses for the user interface 
> and rest responses for both the master and slave processes.
> X-Frame-Options: Can be set to deny, sameorigin, or allow-from 
> X-XSS-Protection: 1; mode=block
> These would go a long way to making sites using mesos more secure. Note that 
> the user exploiting attacks does not need to have access to the mesos hosts, 
> they are attacked through a user's web browser. So if the user can connect to 
> both mesos and the internet, it is an issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5210) Reliably unreserving dynamically reserved resources is unattainable.

2016-04-20 Thread Yan Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15250265#comment-15250265
 ] 

Yan Xu commented on MESOS-5210:
---

Hey thanks [~neilc]! I wasn't aware of MESOS-3746 but was thinking towards this 
direction as well. Will follow up with more thoughts.

> Reliably unreserving dynamically reserved resources is unattainable.
> 
>
> Key: MESOS-5210
> URL: https://issues.apache.org/jira/browse/MESOS-5210
> Project: Mesos
>  Issue Type: Bug
>Reporter: Yan Xu
>
> To unreserve the resource, the current prescribed workflow is to 
> 1. Wait for offer with the reserved resource after the scheduler/role is done 
> using it.
> 2. Call unreserve on this resource/offer.
> 3. Done
> However this is not reliable if:
> 1. Master fails to receive the call: this will result in the reserved 
> resources to be offered to the role again, at least there is some signal in 
> this case.
> 2. Master processes the call but the slave fails to receive the 
> {{CheckpointResourcesMessage}} -> then if the master fails over, the slave 
> will reregister with the resource still reserved -> inconsistency here.
> 3. Master and slave both have processed the call and the resource is 
> unreserved, there is no guarantee that the role unreserving it would receive 
> the offer back. Even if it receives an offer and if the reserved resource is  
> fungible, it cannot distinguish between a resource that is newly unreserved 
> or that is additional resource which is just freed up. 
> If the framework doesn't go away, it can facilitate the reconciliation but if 
> it wants to terminate, the question is when can it?
> The best strategy right now seems to be to for the stopping framework to wait 
> (with a timeout) for the offer to come back after the unreserve call for some 
> verification that's not bulletproof and leave the rest to the operator.
> We should improve the reliability of unreserve operations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4992) sandbox uri does not work outisde mesos http server

2016-04-20 Thread Kyle Anderson (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15250261#comment-15250261
 ] 

Kyle Anderson commented on MESOS-4992:
--

I think this issue explains (or is the same) why sandbox links don't work when 
I open them in new tabs.

> sandbox uri does not work outisde mesos http server
> ---
>
> Key: MESOS-4992
> URL: https://issues.apache.org/jira/browse/MESOS-4992
> Project: Mesos
>  Issue Type: Bug
>  Components: webui
>Affects Versions: 0.27.1
>Reporter: Stavros Kontopoulos
>  Labels: mesosphere
> Fix For: 0.29.0
>
>
> The SandBox uri of a framework does not work if i just copy paste it to the 
> browser.
> For example the following sandbox uri:
> http://172.17.0.1:5050/#/slaves/50f87c73-79ef-4f2a-95f0-b2b4062b2de6-S0/frameworks/50f87c73-79ef-4f2a-95f0-b2b4062b2de6-0009/executors/driver-20160321155016-0001/browse
> should redirect to:
> http://172.17.0.1:5050/#/slaves/50f87c73-79ef-4f2a-95f0-b2b4062b2de6-S0/browse?path=%2Ftmp%2Fmesos%2Fslaves%2F50f87c73-79ef-4f2a-95f0-b2b4062b2de6-S0%2Fframeworks%2F50f87c73-79ef-4f2a-95f0-b2b4062b2de6-0009%2Fexecutors%2Fdriver-20160321155016-0001%2Fruns%2F60533483-31fb-4353-987d-f3393911cc80
> yet it fails with the message:
> "Failed to find slaves.
> Navigate to the slave's sandbox via the Mesos UI."
> and redirects to:
> http://172.17.0.1:5050/#/
> It is an issue for me because im working on expanding the mesos spark ui with 
> sandbox uri, The other option is to get the slave info and parse the json 
> file there and get executor paths not so straightforward or elegant though.
> Moreover i dont see the runs/container_id in the Mesos Proto Api. I guess 
> this is hidden info, this is the needed piece of info to re-write the uri 
> without redirection.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5193) Recovery failed: Failed to recover registrar on reboot of mesos master

2016-04-20 Thread Priyanka Gupta (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15250238#comment-15250238
 ] 

Priyanka Gupta commented on MESOS-5193:
---

[~neilc] : Any updates on this one?

> Recovery failed: Failed to recover registrar on reboot of mesos master
> --
>
> Key: MESOS-5193
> URL: https://issues.apache.org/jira/browse/MESOS-5193
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 0.22.0, 0.27.0
>Reporter: Priyanka Gupta
>  Labels: master, mesosphere
> Attachments: node1.log, node2.log, node3.log
>
>
> Hi all, 
> We are using a 3 node cluster with mesos master, mesos slave and zookeeper on 
> all of them. We are using chronos on top of it. The problem is when we reboot 
> the mesos master leader, the other nodes try to get elected as leader but 
> fail with recovery registrar issue. 
> "Recovery failed: Failed to recover registrar: Failed to perform fetch within 
> 1mins"
> The next node then try to become the leader but again fails with same error. 
> I am not sure about the issue. We are currently using mesos 0.22 and also 
> tried to upgrade to mesos 0.27 as well but the problem continues to happen. 
>  /usr/sbin/mesos-master --work_dir=/tmp/mesos_dir 
> --zk=zk://node1:2181,node2:2181,node3:2181/mesos --quorum=2
> Can you please help us resolve this issue as its a production system.
> Thanks,
> Priyanka



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3059) Allow http endpoint to dynamically change the slave attributes

2016-04-20 Thread Adam B (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15250056#comment-15250056
 ] 

Adam B commented on MESOS-3059:
---

[~klaus1982], please see the patch from [~xds2000] and discussion in MESOS-1739 
where [~vinodkone] previously offered to shepherd. These two issues are closely 
related. We need to be able to update attributes period (even by agent restart) 
before we can add an http endpoint that does the same without restarting the 
agent.

> Allow http endpoint to dynamically change the slave attributes
> --
>
> Key: MESOS-3059
> URL: https://issues.apache.org/jira/browse/MESOS-3059
> Project: Mesos
>  Issue Type: Wish
>Reporter: Nitin
>Assignee: Klaus Ma
>  Labels: mesosphere
>
> This is well understood that - changing the attributes dynamically is not 
> safe without a restart because slave itself may not know which old framework 
> tasks are running on it that were dependent on previous attributes. 
> However, total restart makes lot of other history to delete. We need to 
> ensure a dynamic attribute changes with a soft restart. 
> It will be good to expose a rest endpoint either at slave or mesos-master 
> which directly changes the state in zookeeper.
> USE-CASE
> We use slave attributes/roles to direct the framework scheduling to use 
> specific slave as per it's requirements. Mesos scheduler only creates the 
> offer on the basis of some resources.
> In our use case, we have some categorization of our spark frameworks or jobs 
> with framework(like marathon) based on multiple factors. We want job or 
> frameworks belonging to one category be running into their specific cluster 
> of resources. We want to dynamically manage the slaves into these logical 
> sub-clusters.
> Since number of jobs that will be submitted or when it will be submitted is 
> very dynamic, it make sense to be able to dynamically assign roles or 
> attributes to slaves. It is not possible to gauge the requirements at time of 
> cluster provisioning. Static role or attribute assignment leads to 
> sub-optimal use of the cluster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-3059) Allow http endpoint to dynamically change the slave attributes

2016-04-20 Thread Klaus Ma (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Klaus Ma reassigned MESOS-3059:
---

Assignee: Klaus Ma

> Allow http endpoint to dynamically change the slave attributes
> --
>
> Key: MESOS-3059
> URL: https://issues.apache.org/jira/browse/MESOS-3059
> Project: Mesos
>  Issue Type: Wish
>Reporter: Nitin
>Assignee: Klaus Ma
>  Labels: mesosphere
>
> This is well understood that - changing the attributes dynamically is not 
> safe without a restart because slave itself may not know which old framework 
> tasks are running on it that were dependent on previous attributes. 
> However, total restart makes lot of other history to delete. We need to 
> ensure a dynamic attribute changes with a soft restart. 
> It will be good to expose a rest endpoint either at slave or mesos-master 
> which directly changes the state in zookeeper.
> USE-CASE
> We use slave attributes/roles to direct the framework scheduling to use 
> specific slave as per it's requirements. Mesos scheduler only creates the 
> offer on the basis of some resources.
> In our use case, we have some categorization of our spark frameworks or jobs 
> with framework(like marathon) based on multiple factors. We want job or 
> frameworks belonging to one category be running into their specific cluster 
> of resources. We want to dynamically manage the slaves into these logical 
> sub-clusters.
> Since number of jobs that will be submitted or when it will be submitted is 
> very dynamic, it make sense to be able to dynamically assign roles or 
> attributes to slaves. It is not possible to gauge the requirements at time of 
> cluster provisioning. Static role or attribute assignment leads to 
> sub-optimal use of the cluster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-5167) Add tests for `network/cni` isolator

2016-04-20 Thread Qian Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15237322#comment-15237322
 ] 

Qian Zhang edited comment on MESOS-5167 at 4/20/16 2:42 PM:


Review chain:
https://reviews.apache.org/r/46096/
https://reviews.apache.org/r/46097/
https://reviews.apache.org/r/46435/
https://reviews.apache.org/r/46436/
https://reviews.apache.org/r/46438/


was (Author: qianzhang):
Review chain:
https://reviews.apache.org/r/46096/
https://reviews.apache.org/r/46097/

> Add tests for `network/cni` isolator
> 
>
> Key: MESOS-5167
> URL: https://issues.apache.org/jira/browse/MESOS-5167
> Project: Mesos
>  Issue Type: Task
>  Components: test
>Reporter: Qian Zhang
>Assignee: Qian Zhang
>
> We need to add tests to verify the functionality of `network/cni` isolator.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5210) Reliably unreserving dynamically reserved resources is unattainable.

2016-04-20 Thread Neil Conway (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15249989#comment-15249989
 ] 

Neil Conway commented on MESOS-5210:


I agree, this is an issue. In the past, we have talked about addressing the 
problem via some combination of (a) reservation IDs, (b) a "reconciliation" 
operation for reservations -- see MESOS-3746 and MESOS-3826.

> Reliably unreserving dynamically reserved resources is unattainable.
> 
>
> Key: MESOS-5210
> URL: https://issues.apache.org/jira/browse/MESOS-5210
> Project: Mesos
>  Issue Type: Bug
>Reporter: Yan Xu
>
> To unreserve the resource, the current prescribed workflow is to 
> 1. Wait for offer with the reserved resource after the scheduler/role is done 
> using it.
> 2. Call unreserve on this resource/offer.
> 3. Done
> However this is not reliable if:
> 1. Master fails to receive the call: this will result in the reserved 
> resources to be offered to the role again, at least there is some signal in 
> this case.
> 2. Master processes the call but the slave fails to receive the 
> {{CheckpointResourcesMessage}} -> then if the master fails over, the slave 
> will reregister with the resource still reserved -> inconsistency here.
> 3. Master and slave both have processed the call and the resource is 
> unreserved, there is no guarantee that the role unreserving it would receive 
> the offer back. Even if it receives an offer and if the reserved resource is  
> fungible, it cannot distinguish between a resource that is newly unreserved 
> or that is additional resource which is just freed up. 
> If the framework doesn't go away, it can facilitate the reconciliation but if 
> it wants to terminate, the question is when can it?
> The best strategy right now seems to be to for the stopping framework to wait 
> (with a timeout) for the offer to come back after the unreserve call for some 
> verification that's not bulletproof and leave the rest to the operator.
> We should improve the reliability of unreserve operations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5219) Add security headers to HTTP response

2016-04-20 Thread Neil Conway (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15249981#comment-15249981
 ] 

Neil Conway commented on MESOS-5219:


[~dlaidlaw] -- thanks for the report. I'm not very familiar with XSS attacks or 
click jacking -- can you describe a hypothetical scenario in which Mesos would 
be involved in such an attack, and how the headers you suggest adding would 
prevent the attack?

> Add security headers to HTTP response
> -
>
> Key: MESOS-5219
> URL: https://issues.apache.org/jira/browse/MESOS-5219
> Project: Mesos
>  Issue Type: Improvement
>  Components: HTTP API
>Reporter: Don Laidlaw
>
> Cross site scripting and click jacking are major concerns. Many issues can be 
> resolved by setting some headers in the HTTP responses for the user interface 
> and rest responses for both the master and slave processes.
> X-Frame-Options: Can be set to deny, sameorigin, or allow-from 
> X-XSS-Protection: 1; mode=block
> These would go a long way to making sites using mesos more secure. Note that 
> the user exploiting attacks does not need to have access to the mesos hosts, 
> they are attacked through a user's web browser. So if the user can connect to 
> both mesos and the internet, it is an issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5215) Update the documentation for '/reserve' and '/create-volumes'

2016-04-20 Thread Neil Conway (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15249977#comment-15249977
 ] 

Neil Conway commented on MESOS-5215:


[~greggomann] -- should there be JIRAs linked to this ticket?

> Update the documentation for '/reserve' and '/create-volumes'
> -
>
> Key: MESOS-5215
> URL: https://issues.apache.org/jira/browse/MESOS-5215
> Project: Mesos
>  Issue Type: Documentation
>  Components: documentation
>Affects Versions: 0.28.1
>Reporter: Greg Mann
>  Labels: documentation, mesosphere
>
> There are a couple issues related to the {{principal}} field in {{DiskInfo}} 
> and {{ReservationInfo}} (see linked JIRAs) that should be better documented. 
> We need to help users understand the purpose/significance of these fields, 
> and how to use them properly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3059) Allow http endpoint to dynamically change the slave attributes

2016-04-20 Thread Aaron Carey (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15249903#comment-15249903
 ] 

Aaron Carey commented on MESOS-3059:


We also would love this feature!

> Allow http endpoint to dynamically change the slave attributes
> --
>
> Key: MESOS-3059
> URL: https://issues.apache.org/jira/browse/MESOS-3059
> Project: Mesos
>  Issue Type: Wish
>Reporter: Nitin
>  Labels: mesosphere
>
> This is well understood that - changing the attributes dynamically is not 
> safe without a restart because slave itself may not know which old framework 
> tasks are running on it that were dependent on previous attributes. 
> However, total restart makes lot of other history to delete. We need to 
> ensure a dynamic attribute changes with a soft restart. 
> It will be good to expose a rest endpoint either at slave or mesos-master 
> which directly changes the state in zookeeper.
> USE-CASE
> We use slave attributes/roles to direct the framework scheduling to use 
> specific slave as per it's requirements. Mesos scheduler only creates the 
> offer on the basis of some resources.
> In our use case, we have some categorization of our spark frameworks or jobs 
> with framework(like marathon) based on multiple factors. We want job or 
> frameworks belonging to one category be running into their specific cluster 
> of resources. We want to dynamically manage the slaves into these logical 
> sub-clusters.
> Since number of jobs that will be submitted or when it will be submitted is 
> very dynamic, it make sense to be able to dynamically assign roles or 
> attributes to slaves. It is not possible to gauge the requirements at time of 
> cluster provisioning. Static role or attribute assignment leads to 
> sub-optimal use of the cluster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-5227) Implement HTTP Docker Executor that uses the Executor Library

2016-04-20 Thread Yong Tang (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yong Tang reassigned MESOS-5227:


Assignee: Yong Tang

> Implement HTTP Docker Executor that uses the Executor Library
> -
>
> Key: MESOS-5227
> URL: https://issues.apache.org/jira/browse/MESOS-5227
> Project: Mesos
>  Issue Type: Bug
>Reporter: Vinod Kone
>Assignee: Yong Tang
>
> Similar to what we did with the HTTP command executor in MESOS-3558 we should 
> have a HTTP docker executor that can speak the v1 Executor API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-887) Scheduler driver should use exited() to detect disconnection with Master.

2016-04-20 Thread Klaus Ma (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Klaus Ma reassigned MESOS-887:
--

Assignee: Klaus Ma

> Scheduler driver should use exited() to detect disconnection with Master.
> -
>
> Key: MESOS-887
> URL: https://issues.apache.org/jira/browse/MESOS-887
> Project: Mesos
>  Issue Type: Improvement
>  Components: framework, master
>Affects Versions: 0.13.0, 0.14.0, 0.14.1, 0.14.2, 0.15.0, 0.16.0
>Reporter: Benjamin Mahler
>Assignee: Klaus Ma
>  Labels: reliability, twitter
>
> The Scheduler Driver already links with the master, but it does not use the 
> built in exited() notification from libprocess to detect socket closure.
> Of particular concern is that, if the socket breaks and subsequent messages 
> are successfully sent on ephemeral sockets, then we don't re-register with 
> the master. Messages may have been dropped.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-4279) Graceful restart of docker task

2016-04-20 Thread Martin Bydzovsky (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15249489#comment-15249489
 ] 

Martin Bydzovsky edited comment on MESOS-4279 at 4/20/16 8:31 AM:
--

Fine, so I will prepare the RB issues. Should I make two separate? Or just one 
fixing both problems? Or just the one fixing the corrupted stdout/err streams? 
Because in the second one, theres the philosophical issue about marking task as 
KILLED vs FINISHED when it ends during the grace period (even though non-docker 
tasks are ending with FINISHED).. :) [~alexr] [~haosd...@gmail.com]?


was (Author: bydga):
Fine, so I will prepare the RB issues. Should I make two separate? Or just one 
fixing both problems? Or just the one fixing the corrupted stdout/err streams? 
Because in the second one, theres the philosophical issue about marking task as 
KILLED vs FINISHED when it end during the grace period (even though non-docker 
tasks are ending with FINISHED).. :) [~alexr] [~haosd...@gmail.com]?

> Graceful restart of docker task
> ---
>
> Key: MESOS-4279
> URL: https://issues.apache.org/jira/browse/MESOS-4279
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, docker
>Affects Versions: 0.25.0, 0.26.0, 0.27.2
>Reporter: Martin Bydzovsky
>Assignee: Qian Zhang
>  Labels: docker, mesosphere
>
> I'm implementing a graceful restarts of our mesos-marathon-docker setup and I 
> came to a following issue:
> (it was already discussed on 
> https://github.com/mesosphere/marathon/issues/2876 and guys form mesosphere 
> got to a point that its probably a docker containerizer problem...)
> To sum it up:
> When i deploy simple python script to all mesos-slaves:
> {code}
> #!/usr/bin/python
> from time import sleep
> import signal
> import sys
> import datetime
> def sigterm_handler(_signo, _stack_frame):
> print "got %i" % _signo
> print datetime.datetime.now().time()
> sys.stdout.flush()
> sleep(2)
> print datetime.datetime.now().time()
> print "ending"
> sys.stdout.flush()
> sys.exit(0)
> signal.signal(signal.SIGTERM, sigterm_handler)
> signal.signal(signal.SIGINT, sigterm_handler)
> try:
> print "Hello"
> i = 0
> while True:
> i += 1
> print datetime.datetime.now().time()
> print "Iteration #%i" % i
> sys.stdout.flush()
> sleep(1)
> finally:
> print "Goodbye"
> {code}
> and I run it through Marathon like
> {code:javascript}
> data = {
>   args: ["/tmp/script.py"],
>   instances: 1,
>   cpus: 0.1,
>   mem: 256,
>   id: "marathon-test-api"
> }
> {code}
> During the app restart I get expected result - the task receives sigterm and 
> dies peacefully (during my script-specified 2 seconds period)
> But when i wrap this python script in a docker:
> {code}
> FROM node:4.2
> RUN mkdir /app
> ADD . /app
> WORKDIR /app
> ENTRYPOINT []
> {code}
> and run appropriate application by Marathon:
> {code:javascript}
> data = {
>   args: ["./script.py"],
>   container: {
>   type: "DOCKER",
>   docker: {
>   image: "bydga/marathon-test-api"
>   },
>   forcePullImage: yes
>   },
>   cpus: 0.1,
>   mem: 256,
>   instances: 1,
>   id: "marathon-test-api"
> }
> {code}
> The task during restart (issued from marathon) dies immediately without 
> having a chance to do any cleanup.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4279) Graceful restart of docker task

2016-04-20 Thread Martin Bydzovsky (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15249489#comment-15249489
 ] 

Martin Bydzovsky commented on MESOS-4279:
-

Fine, so I will prepare the RB issues. Should I make two separate? Or just one 
fixing both problems? Or just the one fixing the corrupted stdout/err streams? 
Because in the second one, theres the philosophical issue about marking task as 
KILLED vs FINISHED when it end during the grace period (even though non-docker 
tasks are ending with FINISHED).. :) [~alexr] [~haosd...@gmail.com]?

> Graceful restart of docker task
> ---
>
> Key: MESOS-4279
> URL: https://issues.apache.org/jira/browse/MESOS-4279
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, docker
>Affects Versions: 0.25.0, 0.26.0, 0.27.2
>Reporter: Martin Bydzovsky
>Assignee: Qian Zhang
>  Labels: docker, mesosphere
>
> I'm implementing a graceful restarts of our mesos-marathon-docker setup and I 
> came to a following issue:
> (it was already discussed on 
> https://github.com/mesosphere/marathon/issues/2876 and guys form mesosphere 
> got to a point that its probably a docker containerizer problem...)
> To sum it up:
> When i deploy simple python script to all mesos-slaves:
> {code}
> #!/usr/bin/python
> from time import sleep
> import signal
> import sys
> import datetime
> def sigterm_handler(_signo, _stack_frame):
> print "got %i" % _signo
> print datetime.datetime.now().time()
> sys.stdout.flush()
> sleep(2)
> print datetime.datetime.now().time()
> print "ending"
> sys.stdout.flush()
> sys.exit(0)
> signal.signal(signal.SIGTERM, sigterm_handler)
> signal.signal(signal.SIGINT, sigterm_handler)
> try:
> print "Hello"
> i = 0
> while True:
> i += 1
> print datetime.datetime.now().time()
> print "Iteration #%i" % i
> sys.stdout.flush()
> sleep(1)
> finally:
> print "Goodbye"
> {code}
> and I run it through Marathon like
> {code:javascript}
> data = {
>   args: ["/tmp/script.py"],
>   instances: 1,
>   cpus: 0.1,
>   mem: 256,
>   id: "marathon-test-api"
> }
> {code}
> During the app restart I get expected result - the task receives sigterm and 
> dies peacefully (during my script-specified 2 seconds period)
> But when i wrap this python script in a docker:
> {code}
> FROM node:4.2
> RUN mkdir /app
> ADD . /app
> WORKDIR /app
> ENTRYPOINT []
> {code}
> and run appropriate application by Marathon:
> {code:javascript}
> data = {
>   args: ["./script.py"],
>   container: {
>   type: "DOCKER",
>   docker: {
>   image: "bydga/marathon-test-api"
>   },
>   forcePullImage: yes
>   },
>   cpus: 0.1,
>   mem: 256,
>   instances: 1,
>   id: "marathon-test-api"
> }
> {code}
> The task during restart (issued from marathon) dies immediately without 
> having a chance to do any cleanup.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4689) Design doc for v1 Operator API

2016-04-20 Thread Kevin Klues (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15249379#comment-15249379
 ] 

Kevin Klues commented on MESOS-4689:


We are still building it out.  I hope to have a draft in the next couple of 
days.

> Design doc for v1 Operator API
> --
>
> Key: MESOS-4689
> URL: https://issues.apache.org/jira/browse/MESOS-4689
> Project: Mesos
>  Issue Type: Documentation
>Reporter: Vinod Kone
>Assignee: Kevin Klues
>
> We need to design how the v1 operator API (all the HTTP endpoints exposed by 
> master/agent that are not for scheduler/executor interactions) looks and 
> works.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4689) Design doc for v1 Operator API

2016-04-20 Thread Jay Guo (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15249375#comment-15249375
 ] 

Jay Guo commented on MESOS-4689:


Hi, is there a link to the doc? Thx!

> Design doc for v1 Operator API
> --
>
> Key: MESOS-4689
> URL: https://issues.apache.org/jira/browse/MESOS-4689
> Project: Mesos
>  Issue Type: Documentation
>Reporter: Vinod Kone
>Assignee: Kevin Klues
>
> We need to design how the v1 operator API (all the HTTP endpoints exposed by 
> master/agent that are not for scheduler/executor interactions) looks and 
> works.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5017) Don't consider agents without allocatable resources in the allocator

2016-04-20 Thread Klaus Ma (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Klaus Ma updated MESOS-5017:

Description: 
During the review r/43668/ , it come out an enhancement that if an agent has 
not allocatable resources, the allocator should filter them out at the 
beginning.

{quote}
Joris Van Remoortere Posted 1 month ago (March 16, 2016, 5:04 a.m.)
Should we filter out slaves that have no allocatable resources?

If we do, let's make sure we note that we want to pass the original slaveids to 
the deallocate function
 The issue has been resolved. Show all issues

Dario Rexin 4 weeks ago (March 23, 2016, 4:25 a.m.)
I'm not sure if it would be a big improvement. Calculating the available 
resources if somewhat expensive and we have to do it again in the loop and most 
slaves will probably have resources available anyway. The reason it's an 
improvement in the loop is, that after we offer the resources to a framework, 
we can be sure that they are all unavailable to the following frameworks under 
the same role.

Klaus Ma 4 weeks ago (March 23, 2016, 11:13 a.m.)
@joris/dario, I think the improvement dependent on the workload patten: 1.) for 
short running tasks, it maybe serveral tasks finished during the allocation 
interval, so maybe no improvement; 2.) but for long running tasks, slave/agent 
should be fully used in most of time, it'll be a big improvement. I used to log 
MESOS-4986 to add a filter after stage 1 (Quota), but maybe useless after 
revocable by default.

Joris Van Remoortere 3 weeks, 6 days ago (March 23, 2016, 8:59 p.m.)
Can you open a JIRA to consider doing this. Along Klaus' example, I'm not 
convinced this wouldn't have a large impact in certain scenarios.
{quote}

  was:In the DRFAllocator we don't check if an agent has allocatable resources 
until we try to offer the resources to a framework. We could already filter 
them out at the beginning. 


> Don't consider agents without allocatable resources in the allocator
> 
>
> Key: MESOS-5017
> URL: https://issues.apache.org/jira/browse/MESOS-5017
> Project: Mesos
>  Issue Type: Improvement
>  Components: allocation
>Reporter: Dario Rexin
>Assignee: Klaus Ma
>Priority: Minor
>
> During the review r/43668/ , it come out an enhancement that if an agent has 
> not allocatable resources, the allocator should filter them out at the 
> beginning.
> {quote}
> Joris Van Remoortere Posted 1 month ago (March 16, 2016, 5:04 a.m.)
> Should we filter out slaves that have no allocatable resources?
> If we do, let's make sure we note that we want to pass the original slaveids 
> to the deallocate function
>  The issue has been resolved. Show all issues
> Dario Rexin 4 weeks ago (March 23, 2016, 4:25 a.m.)
> I'm not sure if it would be a big improvement. Calculating the available 
> resources if somewhat expensive and we have to do it again in the loop and 
> most slaves will probably have resources available anyway. The reason it's an 
> improvement in the loop is, that after we offer the resources to a framework, 
> we can be sure that they are all unavailable to the following frameworks 
> under the same role.
> Klaus Ma 4 weeks ago (March 23, 2016, 11:13 a.m.)
> @joris/dario, I think the improvement dependent on the workload patten: 1.) 
> for short running tasks, it maybe serveral tasks finished during the 
> allocation interval, so maybe no improvement; 2.) but for long running tasks, 
> slave/agent should be fully used in most of time, it'll be a big improvement. 
> I used to log MESOS-4986 to add a filter after stage 1 (Quota), but maybe 
> useless after revocable by default.
> Joris Van Remoortere 3 weeks, 6 days ago (March 23, 2016, 8:59 p.m.)
> Can you open a JIRA to consider doing this. Along Klaus' example, I'm not 
> convinced this wouldn't have a large impact in certain scenarios.
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5060) Requesting /files/read.json with a negative length value causes subsequent /files requests to 404.

2016-04-20 Thread zhou xing (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15249323#comment-15249323
 ] 

zhou xing commented on MESOS-5060:
--

[~greggomann], sure, have updated the review request to only fix the length 
issue. Do you think we need to open a new ticket to track the fix of those 
parameters' logic?

> Requesting /files/read.json with a negative length value causes subsequent 
> /files requests to 404.
> --
>
> Key: MESOS-5060
> URL: https://issues.apache.org/jira/browse/MESOS-5060
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.23.0
> Environment: Mesos 0.23.0 on CentOS 6, also Mesos 0.28.0 on OSX
>Reporter: Tom Petr
>Assignee: zhou xing
>Priority: Minor
> Fix For: 0.29.0
>
>
> I accidentally hit a slave's /files/read.json endpoint with a negative length 
> (ex. http://hostname:5051/files/read.json?path=XXX=0=-100). The 
> HTTP request timed out after 30 seconds with nothing relevant in the slave 
> logs, and subsequent calls to any of the /files endpoints on that slave 
> immediately returned a HTTP 404 response. We ultimately got things working 
> again by restarting the mesos-slave process (checkpointing FTW!), but it'd be 
> wise to guard against negative lengths on the slave's end too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)