[jira] [Commented] (MESOS-5295) The task launched by non-checkpointed HTTP command executor will keep running till executor shutdown grace period (5s) after agent process exits.

2016-06-28 Thread Qian Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15354604#comment-15354604
 ] 

Qian Zhang commented on MESOS-5295:
---

RR: https://reviews.apache.org/r/49354/

> The task launched by non-checkpointed HTTP command executor will keep running 
> till executor shutdown grace period (5s) after agent process exits.
> -
>
> Key: MESOS-5295
> URL: https://issues.apache.org/jira/browse/MESOS-5295
> Project: Mesos
>  Issue Type: Bug
>  Components: HTTP API
>Reporter: Qian Zhang
>Assignee: Qian Zhang
>
> When I test HTTP command executor, I found an issue, here is my steps:
> 1. A framework which has no checkpoint enabled launches a long running task 
> (e.g., sleep 1000).
> 2. After the task is running, kill the agent.
>  
> Then I see the HTTP command executor will terminate after 5s 
> ("DEFAULT_EXECUTOR_SHUTDOWN_GRACE_PERIOD"), but the task will always run. 
> This behavior is not consistent with driver based command executor: after 
> agent is killed, that executor will kill the task and then self terminate 
> after 1s (there is a "os::sleep(Seconds(1));" in "reaped()").
> The root cause of this difference is, for driver based command executor, when 
> the driver found agent is down, it will call executor->shutdown() 
> (https://github.com/apache/mesos/blob/0.28.1/src/exec/exec.cpp#L487), so the 
> executor will kill the task and then self terminate. But for HTTP command 
> executor, its "disconnected()" will be called 
> (https://github.com/apache/mesos/blob/0.28.1/src/executor/executor.cpp#L388) 
> when agent is down, and currently we do not do anything in its 
> "disconnected()", so the task will keep running and the executor will be 
> killed after 5s 
> (https://github.com/apache/mesos/blob/0.28.1/src/executor/executor.cpp#L623).
> The behavior of driver based command executor is correct, we need to make 
> sure HTTP command executor kills the task when agent is down if checkpoint is 
> not enabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5295) The task launched by non-checkpointed HTTP command executor will keep running till executor shutdown grace period (5s) after agent process exits.

2016-06-28 Thread Qian Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15354585#comment-15354585
 ] 

Qian Zhang commented on MESOS-5295:
---

[~jvanz], sorry for the late. Yes, I am working on this, will post the patch 
soon :-)

> The task launched by non-checkpointed HTTP command executor will keep running 
> till executor shutdown grace period (5s) after agent process exits.
> -
>
> Key: MESOS-5295
> URL: https://issues.apache.org/jira/browse/MESOS-5295
> Project: Mesos
>  Issue Type: Bug
>  Components: HTTP API
>Reporter: Qian Zhang
>Assignee: Qian Zhang
>
> When I test HTTP command executor, I found an issue, here is my steps:
> 1. A framework which has no checkpoint enabled launches a long running task 
> (e.g., sleep 1000).
> 2. After the task is running, kill the agent.
>  
> Then I see the HTTP command executor will terminate after 5s 
> ("DEFAULT_EXECUTOR_SHUTDOWN_GRACE_PERIOD"), but the task will always run. 
> This behavior is not consistent with driver based command executor: after 
> agent is killed, that executor will kill the task and then self terminate 
> after 1s (there is a "os::sleep(Seconds(1));" in "reaped()").
> The root cause of this difference is, for driver based command executor, when 
> the driver found agent is down, it will call executor->shutdown() 
> (https://github.com/apache/mesos/blob/0.28.1/src/exec/exec.cpp#L487), so the 
> executor will kill the task and then self terminate. But for HTTP command 
> executor, its "disconnected()" will be called 
> (https://github.com/apache/mesos/blob/0.28.1/src/executor/executor.cpp#L388) 
> when agent is down, and currently we do not do anything in its 
> "disconnected()", so the task will keep running and the executor will be 
> killed after 5s 
> (https://github.com/apache/mesos/blob/0.28.1/src/executor/executor.cpp#L623).
> The behavior of driver based command executor is correct, we need to make 
> sure HTTP command executor kills the task when agent is down if checkpoint is 
> not enabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5705) ZK credential is exposed in /flags and /state

2016-06-28 Thread Adam B (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam B updated MESOS-5705:
--
Priority: Blocker  (was: Critical)

> ZK credential is exposed in /flags and /state
> -
>
> Key: MESOS-5705
> URL: https://issues.apache.org/jira/browse/MESOS-5705
> Project: Mesos
>  Issue Type: Task
>  Components: master, security
>Reporter: Adam B
>Assignee: Alexander Rojas
>Priority: Blocker
>  Labels: mesosphere, security
> Fix For: 1.0.0
>
>
> Mesos allows zk credentials to be embedded in the zk url, but exposes these 
> credentials in the /flags and /state endpoint. Even though /state is 
> authorized, it only filters out frameworks/tasks, so the top-level flags are 
> shown to any authenticated user.
> "zk": "zk://dcos_mesos_master:my_secret_password@127.0.0.1:2181/mesos",
> We need to find some way to hide this data, or even add a first-class 
> VIEW_FLAGS acl that applies to any endpoint that exposes flags.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-5576) Masters may drop the first message they send between masters after a network partition

2016-06-28 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15347587#comment-15347587
 ] 

Joseph Wu edited comment on MESOS-5576 at 6/29/16 2:27 AM:
---

After a discussion with [~benjaminhindman], [~bmahler], and [~jieyu], we 
determined that {{unlink}} semantics are not adequate when the application 
level knows about a broken socket (while libprocess does not).  Instead, the 
option to "relink" is preferable, as this should create a new persistent 
socket, without regards to how other processes are interacting inside 
libprocess.

Review based on [MESOS-5740]: https://reviews.apache.org/r/49346/


was (Author: kaysoky):
After a discussion with [~benjaminhindman], [~bmahler], and [~jieyu], we 
determined that {{unlink}} semantics are not adequate when the application 
level knows about a broken socket (while libprocess does not).  Instead, the 
option to "relink" is preferable, as this should create a new persistent 
socket, without regards to how other processes are interacting inside 
libprocess.

See: [MESOS-5740]

> Masters may drop the first message they send between masters after a network 
> partition
> --
>
> Key: MESOS-5576
> URL: https://issues.apache.org/jira/browse/MESOS-5576
> Project: Mesos
>  Issue Type: Improvement
>  Components: leader election, master, replicated log
>Affects Versions: 0.28.2
> Environment: Observed in an OpenStack environment where each master 
> lives on a separate VM.
>Reporter: Joseph Wu
>Assignee: Joseph Wu
>  Labels: mesosphere
>
> We observed the following situation in a cluster of five masters:
> || Time || Master 1 || Master 2 || Master 3 || Master 4 || Master 5 ||
> | 0 | Follower | Follower | Follower | Follower | Leader |
> | 1 | Follower | Follower | Follower | Follower || Partitioned from cluster 
> by downing this VM's network ||
> | 2 || Elected Leader by ZK | Voting | Voting | Voting | Suicides due to lost 
> leadership |
> | 3 | Performs consensus | Replies to leader | Replies to leader | Replies to 
> leader | Still down |
> | 4 | Performs writing | Acks to leader | Acks to leader | Acks to leader | 
> Still down |
> | 5 | Leader | Follower | Follower | Follower | Still down |
> | 6 | Leader | Follower | Follower | Follower | Comes back up |
> | 7 | Leader | Follower | Follower | Follower | Follower |
> | 8 || Partitioned in the same way as Master 5 | Follower | Follower | 
> Follower | Follower |
> | 9 | Suicides due to lost leadership || Elected Leader by ZK | Follower | 
> Follower | Follower |
> | 10 | Still down | Performs consensus | Replies to leader | Replies to 
> leader || Doesn't get the message! ||
> | 11 | Still down | Performs writing | Acks to leader | Acks to leader || 
> Acks to leader ||
> | 12 | Still down | Leader | Follower | Follower | Follower |
> Master 2 sends a series of messages to the recently-restarted Master 5.  The 
> first message is dropped, but subsequent messages are not dropped.
> This appears to be due to a stale link between the masters.  Before leader 
> election, the replicated log actors create a network watcher, which adds 
> links to masters that join the ZK group:
> https://github.com/apache/mesos/blob/7a23d0da817be4e8f68d96f524cecf802431033c/src/log/network.hpp#L157-L159
> This link does not appear to break (Master 2 -> 5) when Master 5 goes down, 
> perhaps due to how the network partition was induced (in the hypervisor 
> layer, rather than in the VM itself).
> When Master 2 tries to send an {{PromiseRequest}} to Master 5, we do not 
> observe the [expected log 
> message|https://github.com/apache/mesos/blob/7a23d0da817be4e8f68d96f524cecf802431033c/src/log/replica.cpp#L493-L494]
> Instead, we see a log line in Master 2:
> {code}
> process.cpp:2040] Failed to shutdown socket with fd 27: Transport endpoint is 
> not connected
> {code}
> The broken link is removed by the libprocess {{socket_manager}} and the 
> following {{WriteRequest}} from Master 2 to Master 5 succeeds via a new 
> socket.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-5740) Consider adding `relink` functionality to libprocess

2016-06-28 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15354394#comment-15354394
 ] 

Joseph Wu edited comment on MESOS-5740 at 6/29/16 2:27 AM:
---

|| Review || Summary ||
| https://reviews.apache.org/r/49174/ | Test-only libprocess hook |
| https://reviews.apache.org/r/49175/ | Tests + repro |
| https://reviews.apache.org/r/49177/ | Implement "relink" |


was (Author: kaysoky):
|| Review || Summary ||
| https://reviews.apache.org/r/49174/ | Test-only libprocess hook |
| https://reviews.apache.org/r/49175/ | Tests + repro |
| https://reviews.apache.org/r/49177/ | Implement "relink" |
| https://reviews.apache.org/r/49346/ | Network.hpp + relink |

> Consider adding `relink` functionality to libprocess
> 
>
> Key: MESOS-5740
> URL: https://issues.apache.org/jira/browse/MESOS-5740
> Project: Mesos
>  Issue Type: Improvement
>  Components: libprocess
>Reporter: Joseph Wu
>Assignee: Joseph Wu
>  Labels: libprocess, mesosphere
>
> Currently we don't have the {{relink}} functionality in libprocess.  i.e. A 
> way to create a new persistent connection between actors, even if a 
> connection already exists. 
> This can benefit us in a couple of ways:
> - The application may have more information on the state of a connection than 
> libprocess does, as libprocess only checks if the connection is alive or not. 
>  For example, a linkee may accept a connection, then fork, pass the 
> connection to a child, and subsequently exit.  As the connection is still 
> active, libprocess may not detect the exit.
> - Sometimes, the {{ExitedEvent}} might be delayed or might be dropped due to 
> the remote instance being unavailable (e.g., partition, network 
> intermediaries not sending RST's etc). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-5576) Masters may drop the first message they send between masters after a network partition

2016-06-28 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15347587#comment-15347587
 ] 

Joseph Wu edited comment on MESOS-5576 at 6/29/16 2:26 AM:
---

After a discussion with [~benjaminhindman], [~bmahler], and [~jieyu], we 
determined that {{unlink}} semantics are not adequate when the application 
level knows about a broken socket (while libprocess does not).  Instead, the 
option to "relink" is preferable, as this should create a new persistent 
socket, without regards to how other processes are interacting inside 
libprocess.

See: [MESOS-5740]


was (Author: kaysoky):
After a discussion with [~benjaminhindman], [~bmahler], and [~jieyu], we 
determined that {{unlink}} semantics are not adequate when the application 
level knows about a broken socket (while libprocess does not).  Instead, the 
option to "relink" is preferable, as this should create a new persistent 
socket, without regards to how other processes are interacting inside 
libprocess.

|| Review || Summary ||
| https://reviews.apache.org/r/49174/ | Test-only libprocess hook |
| https://reviews.apache.org/r/49175/ | Tests + repro |
| https://reviews.apache.org/r/49177/ | Implement "relink" |
| https://reviews.apache.org/r/49346/ | Network.hpp + relink |

> Masters may drop the first message they send between masters after a network 
> partition
> --
>
> Key: MESOS-5576
> URL: https://issues.apache.org/jira/browse/MESOS-5576
> Project: Mesos
>  Issue Type: Improvement
>  Components: leader election, master, replicated log
>Affects Versions: 0.28.2
> Environment: Observed in an OpenStack environment where each master 
> lives on a separate VM.
>Reporter: Joseph Wu
>Assignee: Joseph Wu
>  Labels: mesosphere
>
> We observed the following situation in a cluster of five masters:
> || Time || Master 1 || Master 2 || Master 3 || Master 4 || Master 5 ||
> | 0 | Follower | Follower | Follower | Follower | Leader |
> | 1 | Follower | Follower | Follower | Follower || Partitioned from cluster 
> by downing this VM's network ||
> | 2 || Elected Leader by ZK | Voting | Voting | Voting | Suicides due to lost 
> leadership |
> | 3 | Performs consensus | Replies to leader | Replies to leader | Replies to 
> leader | Still down |
> | 4 | Performs writing | Acks to leader | Acks to leader | Acks to leader | 
> Still down |
> | 5 | Leader | Follower | Follower | Follower | Still down |
> | 6 | Leader | Follower | Follower | Follower | Comes back up |
> | 7 | Leader | Follower | Follower | Follower | Follower |
> | 8 || Partitioned in the same way as Master 5 | Follower | Follower | 
> Follower | Follower |
> | 9 | Suicides due to lost leadership || Elected Leader by ZK | Follower | 
> Follower | Follower |
> | 10 | Still down | Performs consensus | Replies to leader | Replies to 
> leader || Doesn't get the message! ||
> | 11 | Still down | Performs writing | Acks to leader | Acks to leader || 
> Acks to leader ||
> | 12 | Still down | Leader | Follower | Follower | Follower |
> Master 2 sends a series of messages to the recently-restarted Master 5.  The 
> first message is dropped, but subsequent messages are not dropped.
> This appears to be due to a stale link between the masters.  Before leader 
> election, the replicated log actors create a network watcher, which adds 
> links to masters that join the ZK group:
> https://github.com/apache/mesos/blob/7a23d0da817be4e8f68d96f524cecf802431033c/src/log/network.hpp#L157-L159
> This link does not appear to break (Master 2 -> 5) when Master 5 goes down, 
> perhaps due to how the network partition was induced (in the hypervisor 
> layer, rather than in the VM itself).
> When Master 2 tries to send an {{PromiseRequest}} to Master 5, we do not 
> observe the [expected log 
> message|https://github.com/apache/mesos/blob/7a23d0da817be4e8f68d96f524cecf802431033c/src/log/replica.cpp#L493-L494]
> Instead, we see a log line in Master 2:
> {code}
> process.cpp:2040] Failed to shutdown socket with fd 27: Transport endpoint is 
> not connected
> {code}
> The broken link is removed by the libprocess {{socket_manager}} and the 
> following {{WriteRequest}} from Master 2 to Master 5 succeeds via a new 
> socket.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-5740) Consider adding `relink` functionality to libprocess

2016-06-28 Thread Joseph Wu (JIRA)
Joseph Wu created MESOS-5740:


 Summary: Consider adding `relink` functionality to libprocess
 Key: MESOS-5740
 URL: https://issues.apache.org/jira/browse/MESOS-5740
 Project: Mesos
  Issue Type: Improvement
  Components: libprocess
Reporter: Joseph Wu
Assignee: Joseph Wu


Currently we don't have the {{relink}} functionality in libprocess.  i.e. A way 
to create a new persistent connection between actors, even if a connection 
already exists. 

This can benefit us in a couple of ways:
- The application may have more information on the state of a connection than 
libprocess does, as libprocess only checks if the connection is alive or not.  
For example, a linkee may accept a connection, then fork, pass the connection 
to a child, and subsequently exit.  As the connection is still active, 
libprocess may not detect the exit.
- Sometimes, the {{ExitedEvent}} might be delayed or might be dropped due to 
the remote instance being unavailable (e.g., partition, network intermediaries 
not sending RST's etc). 




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5735) Update WebUI to use v1 operator API

2016-06-28 Thread zhou xing (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15354391#comment-15354391
 ] 

zhou xing commented on MESOS-5735:
--

Given that the response of v1 operator API is different from the previous HTTP 
endpoints, let me go through the endpoints that   WebUI are using now, and then 
may change the endpoints one by one.

> Update WebUI to use v1 operator API
> ---
>
> Key: MESOS-5735
> URL: https://issues.apache.org/jira/browse/MESOS-5735
> Project: Mesos
>  Issue Type: Bug
>Reporter: Vinod Kone
>Assignee: zhou xing
>
> Having the WebUI use the v1 API would be a good validation of it's usefulness 
> and correctness.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-5735) Update WebUI to use v1 operator API

2016-06-28 Thread zhou xing (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhou xing reassigned MESOS-5735:


Assignee: zhou xing

> Update WebUI to use v1 operator API
> ---
>
> Key: MESOS-5735
> URL: https://issues.apache.org/jira/browse/MESOS-5735
> Project: Mesos
>  Issue Type: Bug
>Reporter: Vinod Kone
>Assignee: zhou xing
>
> Having the WebUI use the v1 API would be a good validation of it's usefulness 
> and correctness.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4092) Try to re-establish connection on ping timeouts with agent before removing it

2016-06-28 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15354210#comment-15354210
 ] 

Benjamin Mahler commented on MESOS-4092:


FYI [~idownes] as part of MESOS-5576, we added the ability to force a 
reconnection during link:
https://reviews.apache.org/r/49177/

> Try to re-establish connection on ping timeouts with agent before removing it
> -
>
> Key: MESOS-4092
> URL: https://issues.apache.org/jira/browse/MESOS-4092
> Project: Mesos
>  Issue Type: Improvement
>  Components: master
>Affects Versions: 0.25.0
>Reporter: Ian Downes
>
> The SlaveObserver will trigger an agent to be removed after 
> {{flags.max_slave_ping_timeouts}} timeouts of {{flags.slave_ping_timeout}}. 
> This can occur because of transient network failures, e.g., gray failures of 
> a switch uplink exhibiting heavy or total packet loss. Some network 
> architectures are designed to tolerate such gray failures and support 
> multiple paths between hosts. This can be implemented with equal-cost 
> multi-path routing (ECMP) where flows are hashed by their 5-tuple to multiple 
> possible uplinks. In such networks re-establishing a TCP connection will 
> almost certainly use a new source port and thus will likely be hashed to a 
> different uplink, avoiding the failed uplink and re-establishing connectivity 
> with the agent.
> After failing to receive pongs the SlaveObserver should next try to 
> re-establish a TCP connection (with exponential back-off) before declaring 
> the agent as lost. This can avoid significant disruption where large numbers 
> of agents reached through a single failed link could be removed unnecessarily 
> while still ensuring that agents that are truly lost are recognized as such.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-5739) Enhance Value parsing

2016-06-28 Thread Klaus Ma (JIRA)
Klaus Ma created MESOS-5739:
---

 Summary: Enhance Value parsing
 Key: MESOS-5739
 URL: https://issues.apache.org/jira/browse/MESOS-5739
 Project: Mesos
  Issue Type: Bug
  Components: master
Reporter: Klaus Ma
Assignee: Klaus Ma


Enhanced Value parsing:

{code}
1. Did not support [1-2, [3-4]] as Ranges; it should be [1-2, 3-4].
2. Did not support {a{b, c}d} as Set; it should be {ab, cd}
3. Add check for Text against [a-zA-Z0-9_/.-]
{code}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5739) Enhance Value parsing

2016-06-28 Thread Klaus Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15353941#comment-15353941
 ] 

Klaus Ma commented on MESOS-5739:
-

https://reviews.apache.org/r/49223/

> Enhance Value parsing
> -
>
> Key: MESOS-5739
> URL: https://issues.apache.org/jira/browse/MESOS-5739
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Reporter: Klaus Ma
>Assignee: Klaus Ma
>
> Enhanced Value parsing:
> {code}
> 1. Did not support [1-2, [3-4]] as Ranges; it should be [1-2, 3-4].
> 2. Did not support {a{b, c}d} as Set; it should be {ab, cd}
> 3. Add check for Text against [a-zA-Z0-9_/.-]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5364) Consider adding `unlink` functionality to libprocess

2016-06-28 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15353931#comment-15353931
 ] 

Joseph Wu commented on MESOS-5364:
--

Note: Due to a similar issue with stale links, we will be introducing "relink" 
semantics to libprocess.  Relinking provides better guarantees than "unlinking" 
because the application is guaranteed to have a new socket connection, 
regardless of other linkers.

Here's an example of how relink is used: https://reviews.apache.org/r/49346/

> Consider adding `unlink` functionality to libprocess
> 
>
> Key: MESOS-5364
> URL: https://issues.apache.org/jira/browse/MESOS-5364
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Anand Mazumdar
>  Labels: libprocess, mesosphere
>
> Currently we don't have the {{unlink}} functionality in libprocess i.e. 
> Erlang's equivalent of http://erlang.org/doc/man/erlang.html#unlink-1. We 
> have a lot of places in our current code with {{TODO's}} for implementing it.
> It can benefit us in a couple of ways:
> - Based on the business logic of the actor, it would want to authoritatively 
> communicate that it is no longer interested in {{ExitedEvent}} for the 
> external remote link.
> - Sometimes, the {{ExitedEvent}} might be delayed or might be dropped due to 
> the remote instance being unavailable (e.g., partition, network 
> intermediaries not sending RST's etc). 
> I did not find any old JIRA's pertaining to this but I did come across an 
> initial attempt to add this though albeit for injecting {{exited}} events as 
> part of the initial review for MESOS-1059.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-5576) Masters may drop the first message they send between masters after a network partition

2016-06-28 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15347587#comment-15347587
 ] 

Joseph Wu edited comment on MESOS-5576 at 6/28/16 11:33 PM:


After a discussion with [~benjaminhindman], [~bmahler], and [~jieyu], we 
determined that {{unlink}} semantics are not adequate when the application 
level knows about a broken socket (while libprocess does not).  Instead, the 
option to "relink" is preferable, as this should create a new persistent 
socket, without regards to how other processes are interacting inside 
libprocess.

|| Review || Summary ||
| https://reviews.apache.org/r/49174/ | Test-only libprocess hook |
| https://reviews.apache.org/r/49175/ | Tests + repro |
| https://reviews.apache.org/r/49177/ | Implement "relink" |
| https://reviews.apache.org/r/49346/ | Network.hpp + relink |


was (Author: kaysoky):
After a discussion with [~benjaminhindman], [~bmahler], and [~jieyu], we 
determined that {{unlink}} semantics are not adequate when the application 
level knows about a broken socket (while libprocess does not).  Instead, the 
option to "relink" is preferable, as this should create a new persistent 
socket, without regards to how other processes are interacting inside 
libprocess.

|| Review || Summary ||
| https://reviews.apache.org/r/49174/ | Test-only libprocess hook |
| https://reviews.apache.org/r/49175/ | Tests + repro |
| https://reviews.apache.org/r/49176/ | Network::remove unused |
| https://reviews.apache.org/r/49177/ | Implement "relink" |

> Masters may drop the first message they send between masters after a network 
> partition
> --
>
> Key: MESOS-5576
> URL: https://issues.apache.org/jira/browse/MESOS-5576
> Project: Mesos
>  Issue Type: Improvement
>  Components: leader election, master, replicated log
>Affects Versions: 0.28.2
> Environment: Observed in an OpenStack environment where each master 
> lives on a separate VM.
>Reporter: Joseph Wu
>Assignee: Joseph Wu
>  Labels: mesosphere
>
> We observed the following situation in a cluster of five masters:
> || Time || Master 1 || Master 2 || Master 3 || Master 4 || Master 5 ||
> | 0 | Follower | Follower | Follower | Follower | Leader |
> | 1 | Follower | Follower | Follower | Follower || Partitioned from cluster 
> by downing this VM's network ||
> | 2 || Elected Leader by ZK | Voting | Voting | Voting | Suicides due to lost 
> leadership |
> | 3 | Performs consensus | Replies to leader | Replies to leader | Replies to 
> leader | Still down |
> | 4 | Performs writing | Acks to leader | Acks to leader | Acks to leader | 
> Still down |
> | 5 | Leader | Follower | Follower | Follower | Still down |
> | 6 | Leader | Follower | Follower | Follower | Comes back up |
> | 7 | Leader | Follower | Follower | Follower | Follower |
> | 8 || Partitioned in the same way as Master 5 | Follower | Follower | 
> Follower | Follower |
> | 9 | Suicides due to lost leadership || Elected Leader by ZK | Follower | 
> Follower | Follower |
> | 10 | Still down | Performs consensus | Replies to leader | Replies to 
> leader || Doesn't get the message! ||
> | 11 | Still down | Performs writing | Acks to leader | Acks to leader || 
> Acks to leader ||
> | 12 | Still down | Leader | Follower | Follower | Follower |
> Master 2 sends a series of messages to the recently-restarted Master 5.  The 
> first message is dropped, but subsequent messages are not dropped.
> This appears to be due to a stale link between the masters.  Before leader 
> election, the replicated log actors create a network watcher, which adds 
> links to masters that join the ZK group:
> https://github.com/apache/mesos/blob/7a23d0da817be4e8f68d96f524cecf802431033c/src/log/network.hpp#L157-L159
> This link does not appear to break (Master 2 -> 5) when Master 5 goes down, 
> perhaps due to how the network partition was induced (in the hypervisor 
> layer, rather than in the VM itself).
> When Master 2 tries to send an {{PromiseRequest}} to Master 5, we do not 
> observe the [expected log 
> message|https://github.com/apache/mesos/blob/7a23d0da817be4e8f68d96f524cecf802431033c/src/log/replica.cpp#L493-L494]
> Instead, we see a log line in Master 2:
> {code}
> process.cpp:2040] Failed to shutdown socket with fd 27: Transport endpoint is 
> not connected
> {code}
> The broken link is removed by the libprocess {{socket_manager}} and the 
> following {{WriteRequest}} from Master 2 to Master 5 succeeds via a new 
> socket.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-5737) Expose Executor PID in containers endpoint

2016-06-28 Thread Haris Choudhary (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haris Choudhary reassigned MESOS-5737:
--

Assignee: Haris Choudhary

> Expose Executor PID in containers endpoint
> --
>
> Key: MESOS-5737
> URL: https://issues.apache.org/jira/browse/MESOS-5737
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization
>Reporter: Haris Choudhary
>Assignee: Haris Choudhary
>Priority: Minor
>
> In order to greatly simplify the implementation for the redesigned Mesos 
> CLI's container plugin, we need the executor PID (Process ID) to be exposed 
> in the /containers endpoint. [Mesos CLI Epic | 
> https://issues.apache.org/jira/browse/MESOS-5676]
> This change will introduce the pid for an executor if it was launched by the 
> mesos containerizer.
> We need this PID for setns() calls to enter the container namespace for 
> commands such as container execute



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5737) Expose Executor PID in containers endpoint

2016-06-28 Thread Haris Choudhary (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15353918#comment-15353918
 ] 

Haris Choudhary commented on MESOS-5737:


Updated the description. And PID refers to process ID, not the libprocess pid.

Thanks!

> Expose Executor PID in containers endpoint
> --
>
> Key: MESOS-5737
> URL: https://issues.apache.org/jira/browse/MESOS-5737
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization
>Reporter: Haris Choudhary
>Priority: Minor
>
> In order to greatly simplify the implementation for the redesigned Mesos 
> CLI's container plugin, we need the executor PID (Process ID) to be exposed 
> in the /containers endpoint. [Mesos CLI Epic | 
> https://issues.apache.org/jira/browse/MESOS-5676]
> This change will introduce the pid for an executor if it was launched by the 
> mesos containerizer.
> We need this PID for setns() calls to enter the container namespace for 
> commands such as container execute



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5737) Expose Executor PID in containers endpoint

2016-06-28 Thread Haris Choudhary (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haris Choudhary updated MESOS-5737:
---
Description: 
In order to greatly simplify the implementation for the redesigned Mesos CLI's 
container plugin, we need the executor PID (Process ID) to be exposed in the 
/containers endpoint. [Mesos CLI Epic | 
https://issues.apache.org/jira/browse/MESOS-5676]

This change will introduce the pid for an executor if it was launched by the 
mesos containerizer.

We need this PID for setns() calls to enter the container namespace for 
commands such as container execute

  was:
In order to greatly simplify the implementation for the redesigned Mesos CLI's 
container plugin, we need the executor pid to be exposed in the /containers 
endpoint. [Mesos CLI Epic | https://issues.apache.org/jira/browse/MESOS-5676]

This change will introduce the pid for an executor if it was launched by the 
mesos containerizer.


> Expose Executor PID in containers endpoint
> --
>
> Key: MESOS-5737
> URL: https://issues.apache.org/jira/browse/MESOS-5737
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization
>Reporter: Haris Choudhary
>Priority: Minor
>
> In order to greatly simplify the implementation for the redesigned Mesos 
> CLI's container plugin, we need the executor PID (Process ID) to be exposed 
> in the /containers endpoint. [Mesos CLI Epic | 
> https://issues.apache.org/jira/browse/MESOS-5676]
> This change will introduce the pid for an executor if it was launched by the 
> mesos containerizer.
> We need this PID for setns() calls to enter the container namespace for 
> commands such as container execute



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5737) Expose Executor PID in containers endpoint

2016-06-28 Thread Anand Mazumdar (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15353902#comment-15353902
 ] 

Anand Mazumdar commented on MESOS-5737:
---

It is not quite clear from the description as to what we would be doing with 
the {{PID}} value? Also, HTTP based executors don't have a PID.

> Expose Executor PID in containers endpoint
> --
>
> Key: MESOS-5737
> URL: https://issues.apache.org/jira/browse/MESOS-5737
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization
>Reporter: Haris Choudhary
>Priority: Minor
>
> In order to greatly simplify the implementation for the redesigned Mesos 
> CLI's container plugin, we need the executor pid to be exposed in the 
> /containers endpoint. [Mesos CLI Epic | 
> https://issues.apache.org/jira/browse/MESOS-5676]
> This change will introduce the pid for an executor if it was launched by the 
> mesos containerizer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-5738) Consider adding a CHECK_NOTERROR to stout/check.hpp.

2016-06-28 Thread Jie Yu (JIRA)
Jie Yu created MESOS-5738:
-

 Summary: Consider adding a CHECK_NOTERROR to stout/check.hpp.
 Key: MESOS-5738
 URL: https://issues.apache.org/jira/browse/MESOS-5738
 Project: Mesos
  Issue Type: Wish
  Components: stout
Reporter: Jie Yu


This will be similar to CHECK_NOTNULL which returns actual object.

For instance, we used the following pattern in the code in many places:
{code}
string value = strings::format(...).get();
{code}

It'll be ideal if we can do the following so that we get better diagnosis 
messages when check fails:
{code}
string value = CHECK_NOTERROR(strings::format(...));
{code} 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5737) Expose Executor PID in containers endpoint

2016-06-28 Thread Haris Choudhary (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haris Choudhary updated MESOS-5737:
---
Description: 
In order to greatly simplify the implementation for the redesigned Mesos CLI's 
container plugin, we need the executor pid to be exposed in the /containers 
endpoint. [Mesos CLI Epic | https://issues.apache.org/jira/browse/MESOS-5676]

This change will introduce the pid for an executor if it was launched by the 
mesos containerizer.

  was:
In order to greatly simplify the implementation for the redesigned Mesos CLI's 
container plugin, we need the executor pid to be exposed in the /containers 
endpoint.

This change will introduce the pid for an executor if it was launched by the 
mesos containerizer.


> Expose Executor PID in containers endpoint
> --
>
> Key: MESOS-5737
> URL: https://issues.apache.org/jira/browse/MESOS-5737
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization
>Reporter: Haris Choudhary
>Priority: Minor
>
> In order to greatly simplify the implementation for the redesigned Mesos 
> CLI's container plugin, we need the executor pid to be exposed in the 
> /containers endpoint. [Mesos CLI Epic | 
> https://issues.apache.org/jira/browse/MESOS-5676]
> This change will introduce the pid for an executor if it was launched by the 
> mesos containerizer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-5737) Expose Executor PID in containers endpoint

2016-06-28 Thread Haris Choudhary (JIRA)
Haris Choudhary created MESOS-5737:
--

 Summary: Expose Executor PID in containers endpoint
 Key: MESOS-5737
 URL: https://issues.apache.org/jira/browse/MESOS-5737
 Project: Mesos
  Issue Type: Improvement
  Components: containerization
Reporter: Haris Choudhary
Priority: Minor


In order to greatly simplify the implementation for the redesigned Mesos CLI's 
container plugin, we need the executor pid to be exposed in the /containers 
endpoint.

This change will introduce the pid for an executor if it was launched by the 
mesos containerizer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4778) Add appc/runtime isolator for runtime isolation for appc images.

2016-06-28 Thread Jie Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15353784#comment-15353784
 ] 

Jie Yu commented on MESOS-4778:
---

commit c5b118e6067b7d96df6f0b4bd538f71ac48e8bd4
Author: Srinivas Brahmaroutu 
Date:   Tue Jun 28 15:05:54 2016 -0700

Added proto message definitions to support appc runtime.

Review: https://reviews.apache.org/r/49207/

> Add appc/runtime isolator for runtime isolation for appc images.
> 
>
> Key: MESOS-4778
> URL: https://issues.apache.org/jira/browse/MESOS-4778
> Project: Mesos
>  Issue Type: Task
>  Components: containerization
>Reporter: Jie Yu
>Assignee: Srinivas
>  Labels: containerizer, isolator
>
> Appc image also contains runtime information like 'exec', 'env', 
> 'workingDirectory' etc.
> https://github.com/appc/spec/blob/master/spec/aci.md
> Similar to docker images, we need to support a subset of them (mainly 'exec', 
> 'env' and 'workingDirectory').



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5727) Command executor health check does not work when the task specifies container image.

2016-06-28 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-5727:
--
Shepherd: Jie Yu

> Command executor health check does not work when the task specifies container 
> image.
> 
>
> Key: MESOS-5727
> URL: https://issues.apache.org/jira/browse/MESOS-5727
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.28.2, 1.0.0
>Reporter: Jie Yu
>Assignee: Gilbert Song
> Fix For: 1.0.0
>
>
> Since we launch the task after pivot_root, we no longer has the access to the 
> mesos-health-check binary. The solution is to refactor health check to be a 
> library (libprocess) so that it does not depend on the underlying filesystem.
> One note here is that we should strive to keep both the command executor and 
> the task in the same mount namespace so that Mesos CLI tooling does not need 
> to find the mount namespace for the task. It just need to find the 
> corresponding pid for the executor.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5489) Implement GET_STATE Call in v1 master API.

2016-06-28 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-5489:
--
Shepherd: Vinod Kone

> Implement GET_STATE Call in v1 master API.
> --
>
> Key: MESOS-5489
> URL: https://issues.apache.org/jira/browse/MESOS-5489
> Project: Mesos
>  Issue Type: Task
>Reporter: Vinod Kone
>Assignee: Zhitao Li
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4609) Subprocess should be more intelligent about setting/inheriting libprocess environment variables

2016-06-28 Thread Joseph Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu updated MESOS-4609:
-
Fix Version/s: (was: 1.0.0)

> Subprocess should be more intelligent about setting/inheriting libprocess 
> environment variables 
> 
>
> Key: MESOS-4609
> URL: https://issues.apache.org/jira/browse/MESOS-4609
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Affects Versions: 0.27.0
>Reporter: Joseph Wu
>Assignee: Joseph Wu
>  Labels: mesosphere
>
> Mostly copied from [this 
> comment|https://issues.apache.org/jira/browse/MESOS-4598?focusedCommentId=15133497=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15133497]
> A subprocess inheriting the environment variables {{LIBPROCESS_*}} may run 
> into some accidental fatalities:
> | || Subprocess uses libprocess || Subprocess is something else ||
> || Subprocess sets/inherits the same {{PORT}} by accident | Bind failure -> 
> exit | Nothing happens (?) |
> || Subprocess sets a different {{PORT}} on purpose | Bind success (?) | 
> Nothing happens (?) |
> (?) = means this is usually the case, but not 100%.
> A complete fix would look something like:
> * If the {{subprocess}} call gets {{environment = None()}}, we should 
> automatically remove {{LIBPROCESS_PORT}} from the inherited environment.  
> * The parts of 
> [{{executorEnvironment}}|https://github.com/apache/mesos/blame/master/src/slave/containerizer/containerizer.cpp#L265]
>  dealing with libprocess & libmesos should be refactored into libprocess as a 
> helper.  We would use this helper for the Containerizer, Fetcher, and 
> ContainerLogger module.
> * If the {{subprocess}} call is given {{LIBPROCESS_PORT == 
> os::getenv("LIBPROCESS_PORT")}}, we can LOG(WARN) and unset the env var 
> locally.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5712) Document exactly what is handled by GET_ENDPOINTS_WITH_PATH acl

2016-06-28 Thread Adam B (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam B updated MESOS-5712:
--
Shepherd:   (was: Adam B)

> Document exactly what is handled by GET_ENDPOINTS_WITH_PATH acl
> ---
>
> Key: MESOS-5712
> URL: https://issues.apache.org/jira/browse/MESOS-5712
> Project: Mesos
>  Issue Type: Task
>  Components: documentation, security
>Reporter: Adam B
>Assignee: Alexander Rojas
>Priority: Minor
>  Labels: documentation, mesosphere, security
> Fix For: 1.0.0
>
>
> Users may expect that the GET_ENDPOINT_WITH_PATH acl can be used with any 
> Mesos endpoint, but that is not (yet) the case. We should clearly document 
> the list of applicable endpoints, in authorization.md and probably even 
> upgrades.md.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-5736) Document how to work with offers

2016-06-28 Thread Neil Conway (JIRA)
Neil Conway created MESOS-5736:
--

 Summary: Document how to work with offers
 Key: MESOS-5736
 URL: https://issues.apache.org/jira/browse/MESOS-5736
 Project: Mesos
  Issue Type: Documentation
  Components: documentation
Reporter: Neil Conway


* Offers: what resources can appear together in the same resource offer, which 
offers can be delivered together, and which offers can be accepted with a 
single ACCEPT call
* Filters, declining offers, using "revive"
* Revocable resources / oversubscription



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-5735) Update WebUI to use v1 operator API

2016-06-28 Thread Vinod Kone (JIRA)
Vinod Kone created MESOS-5735:
-

 Summary: Update WebUI to use v1 operator API
 Key: MESOS-5735
 URL: https://issues.apache.org/jira/browse/MESOS-5735
 Project: Mesos
  Issue Type: Bug
Reporter: Vinod Kone


Having the WebUI use the v1 API would be a good validation of it's usefulness 
and correctness.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5295) The task launched by non-checkpointed HTTP command executor will keep running till executor shutdown grace period (5s) after agent process exits.

2016-06-28 Thread JIRA

[ 
https://issues.apache.org/jira/browse/MESOS-5295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15353597#comment-15353597
 ] 

José Guilherme Vanz commented on MESOS-5295:


Hi [~qianzhang], are you working on this? If not, might I try to fix it?

> The task launched by non-checkpointed HTTP command executor will keep running 
> till executor shutdown grace period (5s) after agent process exits.
> -
>
> Key: MESOS-5295
> URL: https://issues.apache.org/jira/browse/MESOS-5295
> Project: Mesos
>  Issue Type: Bug
>  Components: HTTP API
>Reporter: Qian Zhang
>Assignee: Qian Zhang
>
> When I test HTTP command executor, I found an issue, here is my steps:
> 1. A framework which has no checkpoint enabled launches a long running task 
> (e.g., sleep 1000).
> 2. After the task is running, kill the agent.
>  
> Then I see the HTTP command executor will terminate after 5s 
> ("DEFAULT_EXECUTOR_SHUTDOWN_GRACE_PERIOD"), but the task will always run. 
> This behavior is not consistent with driver based command executor: after 
> agent is killed, that executor will kill the task and then self terminate 
> after 1s (there is a "os::sleep(Seconds(1));" in "reaped()").
> The root cause of this difference is, for driver based command executor, when 
> the driver found agent is down, it will call executor->shutdown() 
> (https://github.com/apache/mesos/blob/0.28.1/src/exec/exec.cpp#L487), so the 
> executor will kill the task and then self terminate. But for HTTP command 
> executor, its "disconnected()" will be called 
> (https://github.com/apache/mesos/blob/0.28.1/src/executor/executor.cpp#L388) 
> when agent is down, and currently we do not do anything in its 
> "disconnected()", so the task will keep running and the executor will be 
> killed after 5s 
> (https://github.com/apache/mesos/blob/0.28.1/src/executor/executor.cpp#L623).
> The behavior of driver based command executor is correct, we need to make 
> sure HTTP command executor kills the task when agent is down if checkpoint is 
> not enabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Issue Comment Deleted] (MESOS-5295) The task launched by non-checkpointed HTTP command executor will keep running till executor shutdown grace period (5s) after agent process exits.

2016-06-28 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/MESOS-5295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

José Guilherme Vanz updated MESOS-5295:
---
Comment: was deleted

(was: Hi [~qianzhang], are you working on this? If not, might I try to fix it?
)

> The task launched by non-checkpointed HTTP command executor will keep running 
> till executor shutdown grace period (5s) after agent process exits.
> -
>
> Key: MESOS-5295
> URL: https://issues.apache.org/jira/browse/MESOS-5295
> Project: Mesos
>  Issue Type: Bug
>  Components: HTTP API
>Reporter: Qian Zhang
>Assignee: Qian Zhang
>
> When I test HTTP command executor, I found an issue, here is my steps:
> 1. A framework which has no checkpoint enabled launches a long running task 
> (e.g., sleep 1000).
> 2. After the task is running, kill the agent.
>  
> Then I see the HTTP command executor will terminate after 5s 
> ("DEFAULT_EXECUTOR_SHUTDOWN_GRACE_PERIOD"), but the task will always run. 
> This behavior is not consistent with driver based command executor: after 
> agent is killed, that executor will kill the task and then self terminate 
> after 1s (there is a "os::sleep(Seconds(1));" in "reaped()").
> The root cause of this difference is, for driver based command executor, when 
> the driver found agent is down, it will call executor->shutdown() 
> (https://github.com/apache/mesos/blob/0.28.1/src/exec/exec.cpp#L487), so the 
> executor will kill the task and then self terminate. But for HTTP command 
> executor, its "disconnected()" will be called 
> (https://github.com/apache/mesos/blob/0.28.1/src/executor/executor.cpp#L388) 
> when agent is down, and currently we do not do anything in its 
> "disconnected()", so the task will keep running and the executor will be 
> killed after 5s 
> (https://github.com/apache/mesos/blob/0.28.1/src/executor/executor.cpp#L623).
> The behavior of driver based command executor is correct, we need to make 
> sure HTTP command executor kills the task when agent is down if checkpoint is 
> not enabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5295) The task launched by non-checkpointed HTTP command executor will keep running till executor shutdown grace period (5s) after agent process exits.

2016-06-28 Thread JIRA

[ 
https://issues.apache.org/jira/browse/MESOS-5295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15353596#comment-15353596
 ] 

José Guilherme Vanz commented on MESOS-5295:


Hi [~qianzhang], are you working on this? If not, might I try to fix it?


> The task launched by non-checkpointed HTTP command executor will keep running 
> till executor shutdown grace period (5s) after agent process exits.
> -
>
> Key: MESOS-5295
> URL: https://issues.apache.org/jira/browse/MESOS-5295
> Project: Mesos
>  Issue Type: Bug
>  Components: HTTP API
>Reporter: Qian Zhang
>Assignee: Qian Zhang
>
> When I test HTTP command executor, I found an issue, here is my steps:
> 1. A framework which has no checkpoint enabled launches a long running task 
> (e.g., sleep 1000).
> 2. After the task is running, kill the agent.
>  
> Then I see the HTTP command executor will terminate after 5s 
> ("DEFAULT_EXECUTOR_SHUTDOWN_GRACE_PERIOD"), but the task will always run. 
> This behavior is not consistent with driver based command executor: after 
> agent is killed, that executor will kill the task and then self terminate 
> after 1s (there is a "os::sleep(Seconds(1));" in "reaped()").
> The root cause of this difference is, for driver based command executor, when 
> the driver found agent is down, it will call executor->shutdown() 
> (https://github.com/apache/mesos/blob/0.28.1/src/exec/exec.cpp#L487), so the 
> executor will kill the task and then self terminate. But for HTTP command 
> executor, its "disconnected()" will be called 
> (https://github.com/apache/mesos/blob/0.28.1/src/executor/executor.cpp#L388) 
> when agent is down, and currently we do not do anything in its 
> "disconnected()", so the task will keep running and the executor will be 
> killed after 5s 
> (https://github.com/apache/mesos/blob/0.28.1/src/executor/executor.cpp#L623).
> The behavior of driver based command executor is correct, we need to make 
> sure HTTP command executor kills the task when agent is down if checkpoint is 
> not enabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5499) Implement RESERVE_RESOURCES Call in v1 master API.

2016-06-28 Thread Anand Mazumdar (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15353463#comment-15353463
 ] 

Anand Mazumdar commented on MESOS-5499:
---

{noformat}
commit d43f134874be130128b57929bc0917f25344d169
Author: Abhishek Dasgupta 
Date:   Tue Jun 28 11:00:59 2016 -0700

Added validation check on resources for reserve/unreserve call.

Review: https://reviews.apache.org/r/49296/
{noformat}

> Implement RESERVE_RESOURCES Call in v1 master API.
> --
>
> Key: MESOS-5499
> URL: https://issues.apache.org/jira/browse/MESOS-5499
> Project: Mesos
>  Issue Type: Task
>Reporter: Vinod Kone
>Assignee: Abhishek Dasgupta
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5510) Implement REMOVE_QUOTA Call in v1 master API.

2016-06-28 Thread Anand Mazumdar (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-5510:
--
Shepherd: Anand Mazumdar

> Implement REMOVE_QUOTA Call in v1 master API.
> -
>
> Key: MESOS-5510
> URL: https://issues.apache.org/jira/browse/MESOS-5510
> Project: Mesos
>  Issue Type: Task
>Reporter: Vinod Kone
>Assignee: Abhishek Dasgupta
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5597) Document Mesos "health check" feature

2016-06-28 Thread Ken Sipe (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ken Sipe updated MESOS-5597:

Assignee: (was: Ken Sipe)

> Document Mesos "health check" feature
> -
>
> Key: MESOS-5597
> URL: https://issues.apache.org/jira/browse/MESOS-5597
> Project: Mesos
>  Issue Type: Bug
>  Components: documentation
>Reporter: Neil Conway
>  Labels: documentation, health-check, mesosphere
>
> We don't talk about this feature at all.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-5597) Document Mesos "health check" feature

2016-06-28 Thread Ken Sipe (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ken Sipe reassigned MESOS-5597:
---

Assignee: Ken Sipe

> Document Mesos "health check" feature
> -
>
> Key: MESOS-5597
> URL: https://issues.apache.org/jira/browse/MESOS-5597
> Project: Mesos
>  Issue Type: Bug
>  Components: documentation
>Reporter: Neil Conway
>Assignee: Ken Sipe
>  Labels: documentation, health-check, mesosphere
>
> We don't talk about this feature at all.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5711) Update AUTHORIZATION strings in endpoint help

2016-06-28 Thread Joerg Schad (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joerg Schad updated MESOS-5711:
---
Shepherd: Alexander Rukletsov

> Update AUTHORIZATION strings in endpoint help
> -
>
> Key: MESOS-5711
> URL: https://issues.apache.org/jira/browse/MESOS-5711
> Project: Mesos
>  Issue Type: Task
>  Components: documentation, security
>Reporter: Adam B
>Assignee: Joerg Schad
>Priority: Minor
>  Labels: documentation, mesosphere, security
> Fix For: 1.0.0
>
>
> The endpoint help macros support AUTHENTICATION and AUTHORIZATION sections. 
> We added AUTHORIZATION help for some of the newer endpoints, but not the 
> previously authenticated endpoints.
> Authorization endpoints needing help string updates:
> Master::Http::CREATE_VOLUMES_HELP
> Master::Http::DESTROY_VOLUMES_HELP
> Master::Http::RESERVE_HELP
> Master::Http::STATE_HELP
> Master::Http::STATESUMMARY_HELP
> Master::Http::TEARDOWN_HELP
> Master::Http::TASKS_HELP
> Master::Http::UNRESERVE_HELP
> Slave::Http::STATE_HELP



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-5734) systemd.hpp header should be installed as part of Mesos header files

2016-06-28 Thread Will Rouesnel (JIRA)
Will Rouesnel created MESOS-5734:


 Summary: systemd.hpp header should be installed as part of Mesos 
header files
 Key: MESOS-5734
 URL: https://issues.apache.org/jira/browse/MESOS-5734
 Project: Mesos
  Issue Type: Improvement
  Components: c++ api
Affects Versions: 0.28.2, 1.0.0
Reporter: Will Rouesnel
Priority: Minor


When writing a container logger plugin ala the logrotate logger, there's a need 
to incorporate systemd functionality used internally by mesos to extend the 
lifespan's beyond the executor.

This looks to be something almost every logging plugin would want to do when 
spawning subprocesses, but header files needed for it are currently buried in 
`src/linux` requiring a full Mesos src code installation.

This functionality needs to be lifted and made available as part of the public 
API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5730) Sandbox access authorization should fail for non existing sandboxes.

2016-06-28 Thread Joerg Schad (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15353089#comment-15353089
 ] 

Joerg Schad commented on MESOS-5730:


Proposed doc fix: https://reviews.apache.org/r/49319/

> Sandbox access authorization should fail for non existing sandboxes.
> 
>
> Key: MESOS-5730
> URL: https://issues.apache.org/jira/browse/MESOS-5730
> Project: Mesos
>  Issue Type: Bug
>  Components: security
>Affects Versions: 1.0.0
>Reporter: Till Toenshoff
>Priority: Blocker
>  Labels: authorization, mesosphere, security
> Fix For: 1.0.0
>
>
> The local authorizer currently tries to authorize {{ACCESS_SANDBOX}} even if 
> no further object specification - e.g. {{framework_info}} or 
> {{executor_info}}) where specified / available at that time.
> Given that there is likely no sandbox available if there was no 
> {{executor_info}} provided, I think we should actually fail instead of allow 
> or deny (403).
> A failure would result into an IMHO more appropriate ServiceUnavailable 
> (503).  
> See 
> https://github.com/apache/mesos/commit/c8d67590064e35566274116cede9c6a733187b48#diff-dd692b1640b2628014feca01a94ba1e1R241



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5730) Sandbox access authorization should fail for non existing sandboxes.

2016-06-28 Thread Joerg Schad (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15353086#comment-15353086
 ] 

Joerg Schad commented on MESOS-5730:


Short related issue: 
The comment in authorizer.proto for `ACCESS_SANDBOX` is incorrect in that it 
states that `  // This action will have a `FrameworkInfo` and `ExecutorInfo` 
set,`. We either should fix the semantic (i.e., making them really required) or 
fix the comment. 


> Sandbox access authorization should fail for non existing sandboxes.
> 
>
> Key: MESOS-5730
> URL: https://issues.apache.org/jira/browse/MESOS-5730
> Project: Mesos
>  Issue Type: Bug
>  Components: security
>Affects Versions: 1.0.0
>Reporter: Till Toenshoff
>Priority: Blocker
>  Labels: authorization, mesosphere, security
> Fix For: 1.0.0
>
>
> The local authorizer currently tries to authorize {{ACCESS_SANDBOX}} even if 
> no further object specification - e.g. {{framework_info}} or 
> {{executor_info}}) where specified / available at that time.
> Given that there is likely no sandbox available if there was no 
> {{executor_info}} provided, I think we should actually fail instead of allow 
> or deny (403).
> A failure would result into an IMHO more appropriate ServiceUnavailable 
> (503).  
> See 
> https://github.com/apache/mesos/commit/c8d67590064e35566274116cede9c6a733187b48#diff-dd692b1640b2628014feca01a94ba1e1R241



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-5733) Tests for quota + over-subscription

2016-06-28 Thread Neil Conway (JIRA)
Neil Conway created MESOS-5733:
--

 Summary: Tests for quota + over-subscription
 Key: MESOS-5733
 URL: https://issues.apache.org/jira/browse/MESOS-5733
 Project: Mesos
  Issue Type: Task
  Components: tests
Reporter: Neil Conway


The quota role sorter ignores revocable resources; we should have tests to 
validate that changing the revocable resources at an agent (via 
oversubscription) doesn't change quota allocation behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5709) Authorization for /roles

2016-06-28 Thread Joerg Schad (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15352968#comment-15352968
 ] 

Joerg Schad commented on MESOS-5709:


Proposal: add VIEW_ROLE and adapt GET_WEIGHTS to use that as well.

> Authorization for /roles
> 
>
> Key: MESOS-5709
> URL: https://issues.apache.org/jira/browse/MESOS-5709
> Project: Mesos
>  Issue Type: Task
>  Components: security
>Reporter: Adam B
>Assignee: Joerg Schad
>Priority: Minor
>  Labels: mesosphere, security
> Fix For: 1.0.0
>
>
> The /roles endpoint exposes the list of all roles and their weights, as well 
> as the list of all frameworkIds registered with each role. This is a superset 
> of the information exposed on GET /weights, which we already protect. We 
> should protect the data in /roles the same way.
> - Should we reuse VIEW_FRAMEWORK with role (from /state)?
> - Should we add a new VIEW_ROLE and adapt GET_WEIGHTS to use it?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-2043) Framework auth fail with timeout error and never get authenticated

2016-06-28 Thread Benjamin Bannier (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15332599#comment-15332599
 ] 

Benjamin Bannier edited comment on MESOS-2043 at 6/28/16 1:16 PM:
--

Review: https://reviews.apache.org/r/49308/


was (Author: bbannier):
-Review-: https://reviews.apache.org/r/48744/

EDIT: Review discarded for now as we need a better solution.

> Framework auth fail with timeout error and never get authenticated
> --
>
> Key: MESOS-2043
> URL: https://issues.apache.org/jira/browse/MESOS-2043
> Project: Mesos
>  Issue Type: Bug
>  Components: master, scheduler driver, security, slave
>Affects Versions: 0.21.0
>Reporter: Bhuvan Arumugam
>Assignee: Benjamin Bannier
>Priority: Critical
>  Labels: mesosphere, security
> Attachments: aurora-scheduler.20141104-1606-1706.log, master.log, 
> mesos-master.20141104-1606-1706.log, slave.log
>
>
> I'm facing this issue in master as of 
> https://github.com/apache/mesos/commit/74ea59e144d131814c66972fb0cc14784d3503d4
> As [~adam-mesos] mentioned in IRC, this sounds similar to MESOS-1866. I'm 
> running 1 master and 1 scheduler (aurora). The framework authentication fail 
> due to time out:
> error on mesos master:
> {code}
> I1104 19:37:17.741449  8329 master.cpp:3874] Authenticating 
> scheduler-d2d4437b-d375-4467-a583-362152fe065a@SCHEDULER_IP:8083
> I1104 19:37:17.741585  8329 master.cpp:3885] Using default CRAM-MD5 
> authenticator
> I1104 19:37:17.742106  8336 authenticator.hpp:169] Creating new server SASL 
> connection
> W1104 19:37:22.742959  8329 master.cpp:3953] Authentication timed out
> W1104 19:37:22.743548  8329 master.cpp:3930] Failed to authenticate 
> scheduler-d2d4437b-d375-4467-a583-362152fe065a@SCHEDULER_IP:8083: 
> Authentication discarded
> {code}
> scheduler error:
> {code}
> I1104 19:38:57.885486 49012 sched.cpp:283] Authenticating with master 
> master@MASTER_IP:PORT
> I1104 19:38:57.885928 49002 authenticatee.hpp:133] Creating new client SASL 
> connection
> I1104 19:38:57.890581 49007 authenticatee.hpp:224] Received SASL 
> authentication mechanisms: CRAM-MD5
> I1104 19:38:57.890656 49007 authenticatee.hpp:250] Attempting to authenticate 
> with mechanism 'CRAM-MD5'
> W1104 19:39:02.891196 49005 sched.cpp:378] Authentication timed out
> I1104 19:39:02.891850 49018 sched.cpp:338] Failed to authenticate with master 
> master@MASTER_IP:PORT: Authentication discarded
> {code}
> Looks like 2 instances {{scheduler-20f88a53-5945-4977-b5af-28f6c52d3c94}} & 
> {{scheduler-d2d4437b-d375-4467-a583-362152fe065a}} of same framework is 
> trying to authenticate and fail.
> {code}
> W1104 19:36:30.769420  8319 master.cpp:3930] Failed to authenticate 
> scheduler-20f88a53-5945-4977-b5af-28f6c52d3c94@SCHEDULER_IP:8083: Failed to 
> communicate with authenticatee
> I1104 19:36:42.701441  8328 master.cpp:3860] Queuing up authentication 
> request from scheduler-d2d4437b-d375-4467-a583-362152fe065a@SCHEDULER_IP:8083 
> because authentication is still in progress
> {code}
> Restarting master and scheduler didn't fix it. 
> This particular issue happen with 1 master and 1 scheduler after MESOS-1866 
> is fixed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5712) Document exactly what is handled by GET_ENDPOINTS_WITH_PATH acl

2016-06-28 Thread Adam B (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam B updated MESOS-5712:
--
Assignee: Alexander Rojas

> Document exactly what is handled by GET_ENDPOINTS_WITH_PATH acl
> ---
>
> Key: MESOS-5712
> URL: https://issues.apache.org/jira/browse/MESOS-5712
> Project: Mesos
>  Issue Type: Task
>  Components: documentation, security
>Reporter: Adam B
>Assignee: Alexander Rojas
>Priority: Minor
>  Labels: documentation, mesosphere, security
> Fix For: 1.0.0
>
>
> Users may expect that the GET_ENDPOINT_WITH_PATH acl can be used with any 
> Mesos endpoint, but that is not (yet) the case. We should clearly document 
> the list of applicable endpoints, in authorization.md and probably even 
> upgrades.md.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4609) Subprocess should be more intelligent about setting/inheriting libprocess environment variables

2016-06-28 Thread Adam B (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam B updated MESOS-4609:
--
Component/s: libprocess

> Subprocess should be more intelligent about setting/inheriting libprocess 
> environment variables 
> 
>
> Key: MESOS-4609
> URL: https://issues.apache.org/jira/browse/MESOS-4609
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Affects Versions: 0.27.0
>Reporter: Joseph Wu
>Assignee: Joseph Wu
>  Labels: mesosphere
> Fix For: 1.0.0
>
>
> Mostly copied from [this 
> comment|https://issues.apache.org/jira/browse/MESOS-4598?focusedCommentId=15133497=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15133497]
> A subprocess inheriting the environment variables {{LIBPROCESS_*}} may run 
> into some accidental fatalities:
> | || Subprocess uses libprocess || Subprocess is something else ||
> || Subprocess sets/inherits the same {{PORT}} by accident | Bind failure -> 
> exit | Nothing happens (?) |
> || Subprocess sets a different {{PORT}} on purpose | Bind success (?) | 
> Nothing happens (?) |
> (?) = means this is usually the case, but not 100%.
> A complete fix would look something like:
> * If the {{subprocess}} call gets {{environment = None()}}, we should 
> automatically remove {{LIBPROCESS_PORT}} from the inherited environment.  
> * The parts of 
> [{{executorEnvironment}}|https://github.com/apache/mesos/blame/master/src/slave/containerizer/containerizer.cpp#L265]
>  dealing with libprocess & libmesos should be refactored into libprocess as a 
> helper.  We would use this helper for the Containerizer, Fetcher, and 
> ContainerLogger module.
> * If the {{subprocess}} call is given {{LIBPROCESS_PORT == 
> os::getenv("LIBPROCESS_PORT")}}, we can LOG(WARN) and unset the env var 
> locally.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5730) Sandbox access authorization should fail for non existing sandboxes.

2016-06-28 Thread Adam B (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam B updated MESOS-5730:
--
Component/s: security

> Sandbox access authorization should fail for non existing sandboxes.
> 
>
> Key: MESOS-5730
> URL: https://issues.apache.org/jira/browse/MESOS-5730
> Project: Mesos
>  Issue Type: Bug
>  Components: security
>Affects Versions: 1.0.0
>Reporter: Till Toenshoff
>Priority: Blocker
>  Labels: authorization, mesosphere, security
> Fix For: 1.0.0
>
>
> The local authorizer currently tries to authorize {{ACCESS_SANDBOX}} even if 
> no further object specification - e.g. {{framework_info}} or 
> {{executor_info}}) where specified / available at that time.
> Given that there is likely no sandbox available if there was no 
> {{executor_info}} provided, I think we should actually fail instead of allow 
> or deny (403).
> A failure would result into an IMHO more appropriate ServiceUnavailable 
> (503).  
> See 
> https://github.com/apache/mesos/commit/c8d67590064e35566274116cede9c6a733187b48#diff-dd692b1640b2628014feca01a94ba1e1R241



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5732) MasterAPITest.UnreserveResources is slow

2016-06-28 Thread Neil Conway (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15352934#comment-15352934
 ] 

Neil Conway commented on MESOS-5732:


cc [~a10gupta]

> MasterAPITest.UnreserveResources is slow
> 
>
> Key: MESOS-5732
> URL: https://issues.apache.org/jira/browse/MESOS-5732
> Project: Mesos
>  Issue Type: Improvement
>  Components: tests
>Reporter: Neil Conway
>  Labels: mesosphere
>
> {noformat}
> [ RUN  ] ContentType/MasterAPITest.UnreserveResources/0
> [   OK ] ContentType/MasterAPITest.UnreserveResources/0 (6033 ms)
> [ RUN  ] ContentType/MasterAPITest.UnreserveResources/1
> [   OK ] ContentType/MasterAPITest.UnreserveResources/1 (6041 ms)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-5732) MasterAPITest.UnreserveResources is slow

2016-06-28 Thread Neil Conway (JIRA)
Neil Conway created MESOS-5732:
--

 Summary: MasterAPITest.UnreserveResources is slow
 Key: MESOS-5732
 URL: https://issues.apache.org/jira/browse/MESOS-5732
 Project: Mesos
  Issue Type: Improvement
  Components: tests
Reporter: Neil Conway


{noformat}
[ RUN  ] ContentType/MasterAPITest.UnreserveResources/0
[   OK ] ContentType/MasterAPITest.UnreserveResources/0 (6033 ms)
[ RUN  ] ContentType/MasterAPITest.UnreserveResources/1
[   OK ] ContentType/MasterAPITest.UnreserveResources/1 (6041 ms)
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5724) SSL certificate validation should allow IP only verification.

2016-06-28 Thread Adam B (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam B updated MESOS-5724:
--
Sprint: Mesosphere Sprint 38

> SSL certificate validation should allow IP only verification.
> -
>
> Key: MESOS-5724
> URL: https://issues.apache.org/jira/browse/MESOS-5724
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Affects Versions: 0.23.0, 0.23.1, 0.24.0, 0.24.1, 0.24.2, 0.25.0, 0.25.1, 
> 0.26.0, 0.26.1, 0.27.0, 0.27.1, 0.27.2, 0.27.3, 0.28.0, 0.28.1, 0.28.2, 1.0.0
>Reporter: Till Toenshoff
>Assignee: Till Toenshoff
>Priority: Blocker
>  Labels: libprocess, mesosphere, security, ssl
> Fix For: 1.0.0
>
>
> Our SSL certificate validation currently assumes that the host (on connect 
> and on accept) does have a valid hostname. This however is not true for all  
> environments.
> {{process::network::openssl::verify}} currently only allows the validation of 
> a certificate against a hostname. 
> See 
> https://github.com/apache/mesos/blob/08866edd8a71d12f87f4f4dbefa292729efbf6ae/3rdparty/libprocess/src/openssl.cpp#L546
> RFC2818 however says that it should be perfectly valid to validate a 
> certificate  based on the IP address.
> See https://tools.ietf.org/html/rfc2818
> {noformat}
> In some cases, the URI is specified as an IP address rather than a
> hostname. In this case, the iPAddress subjectAltName must be present
> in the certificate and must exactly match the IP in the URI.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5730) Sandbox access authorization should fail for non existing sandboxes.

2016-06-28 Thread Adam B (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam B updated MESOS-5730:
--
Sprint: Mesosphere Sprint 38

> Sandbox access authorization should fail for non existing sandboxes.
> 
>
> Key: MESOS-5730
> URL: https://issues.apache.org/jira/browse/MESOS-5730
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.0.0
>Reporter: Till Toenshoff
>Priority: Blocker
>  Labels: authorization, mesosphere, security
> Fix For: 1.0.0
>
>
> The local authorizer currently tries to authorize {{ACCESS_SANDBOX}} even if 
> no further object specification - e.g. {{framework_info}} or 
> {{executor_info}}) where specified / available at that time.
> Given that there is likely no sandbox available if there was no 
> {{executor_info}} provided, I think we should actually fail instead of allow 
> or deny (403).
> A failure would result into an IMHO more appropriate ServiceUnavailable 
> (503).  
> See 
> https://github.com/apache/mesos/commit/c8d67590064e35566274116cede9c6a733187b48#diff-dd692b1640b2628014feca01a94ba1e1R241



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5718) Mesos UI shows "Taks is in RUNNING status" but can't find it in the mesos Agent.

2016-06-28 Thread chenqiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chenqiang updated MESOS-5718:
-
Description: 
Now, we find an issue that a task launched by marathon with docker container 
shows "Task is in RUNNING status" in Mesos UI, but can't find it in the mesos 
Agent host. Namely, the docker container doesn't exist but the Task is shown As 
RUNNING in Mesos UI.  so interesting...


Parts log is attached as belows:

```
I0627 14:31:30.239467  3913 slave.cpp:1912] Asked to kill task 
tanmenggang.router-web.jylt-online02.532b8817-391f-11e6-93b3-56847afe9799 of 
framework 20141201-145651-1900714250-5050-3484-
W0627 14:31:30.239547  3913 slave.cpp:2025] Ignoring kill task 
tanmenggang.router-web.jylt-online02.532b8817-391f-11e6-93b3-56847afe9799 
because the executor 
'tanmenggang.router-web.jylt-online02.532b8817-391f-11e6-93b3-56847afe9799' of 
framework 20141201-145651-1900714250-5050-3484- at 
executor(1)@10.153.96.22:14578 is terminating/terminated


I0624 14:46:04.398646  3921 slave.cpp:4511] Sending reconnect request to 
executor 
'tanmenggang.router-web.jylt-online02.532b8817-391f-11e6-93b3-56847afe9799' of 
framework 20141201-145651-1900714250-5050-3484- at 
executor(1)@10.153.96.22:14578

I0624 14:46:06.399073  3899 slave.cpp:2991] Killing un-reregistered executor 
'tanmenggang.router-web.jylt-online02.532b8817-391f-11e6-93b3-56847afe9799' of 
framework 20141201-145651-1900714250-5050-3484- at 
executor(1)@10.153.96.22:14578
I0624 14:46:06.399183  3899 slave.cpp:4571] Finished recovery
I0624 14:46:06.399375  3902 docker.cpp:1724] Destroying container 
'fa37fc7c-7ef1-478a-81a2-cae38ab3e4cb'
I0624 14:46:06.399431  3902 docker.cpp:1852] Running docker stop on container 
'fa37fc7c-7ef1-478a-81a2-cae38ab3e4cb'

I0624 14:46:50.985178  3908 slave.cpp:1912] Asked to kill task 
tanmenggang.router-web.jylt-online02.532b8817-391f-11e6-93b3-56847afe9799 of 
framework 20141201-145651-1900714250-5050-3484-
W0624 14:46:50.985242  3908 slave.cpp:2025] Ignoring kill task 
tanmenggang.router-web.jylt-online02.532b8817-391f-11e6-93b3-56847afe9799 
because the executor 
'tanmenggang.router-web.jylt-online02.532b8817-391f-11e6-93b3-56847afe9799' of 
framework 20141201-145651-1900714250-5050-3484- at 
executor(1)@10.153.96.22:14578 is terminating/terminated
I0624 14:46:58.593792  3901 slave.cpp:4380] Current disk usage 5.54%. Max 
allowed age: 5.912523133557199days


``` 

What's the root cause ? It seems executor of that task is terminated, but the 
task is ignored kill by slave.


FIX: After restart mesos-slave, the RUNNING task becomes  in FAILED status, and 
we can see it is launched again in other Agent, the task restores to normal...


  was:
Now, we find an issue that a task launched by marathon with docker container 
shows "Task is in RUNNING status" in Mesos UI, but can't find it in the mesos 
Agent host. Namely, the docker container doesn't exist but the Task is shown As 
RUNNING in Mesos UI.  so interesting...


Parts log is attached as belows:

```
I0627 14:31:30.239467  3913 slave.cpp:1912] Asked to kill task 
tanmenggang.router-web.jylt-online02.532b8817-391f-11e6-93b3-56847afe9799 of 
framework 20141201-145651-1900714250-5050-3484-
W0627 14:31:30.239547  3913 slave.cpp:2025] Ignoring kill task 
tanmenggang.router-web.jylt-online02.532b8817-391f-11e6-93b3-56847afe9799 
because the executor 
'tanmenggang.router-web.jylt-online02.532b8817-391f-11e6-93b3-56847afe9799' of 
framework 20141201-145651-1900714250-5050-3484- at 
executor(1)@10.153.96.22:14578 is terminating/terminated


I0624 14:46:04.398646  3921 slave.cpp:4511] Sending reconnect request to 
executor 
'tanmenggang.router-web.jylt-online02.532b8817-391f-11e6-93b3-56847afe9799' of 
framework 20141201-145651-1900714250-5050-3484- at 
executor(1)@10.153.96.22:14578

I0624 14:46:06.399073  3899 slave.cpp:2991] Killing un-reregistered executor 
'tanmenggang.router-web.jylt-online02.532b8817-391f-11e6-93b3-56847afe9799' of 
framework 20141201-145651-1900714250-5050-3484- at 
executor(1)@10.153.96.22:14578
I0624 14:46:06.399183  3899 slave.cpp:4571] Finished recovery
I0624 14:46:06.399375  3902 docker.cpp:1724] Destroying container 
'fa37fc7c-7ef1-478a-81a2-cae38ab3e4cb'
I0624 14:46:06.399431  3902 docker.cpp:1852] Running docker stop on container 
'fa37fc7c-7ef1-478a-81a2-cae38ab3e4cb'

``` 

What's the root cause ? It seems executor of that task is terminated, but the 
task is ignored kill by slave.


FIX: After restart mesos-slave, the RUNNING task becomes  in FAILED status, and 
we can see it is launched again in other Agent, the task restores to normal...



> Mesos UI shows "Taks is in RUNNING status" but can't find it in the mesos 
> Agent.
> 
>
> Key: MESOS-5718
> URL: 

[jira] [Updated] (MESOS-5707) LocalAuthorizer should error if passed a GET_ENDPOINT ACL with an unhandled path

2016-06-28 Thread Adam B (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam B updated MESOS-5707:
--
Shepherd:   (was: Adam B)

> LocalAuthorizer should error if passed a GET_ENDPOINT ACL with an unhandled 
> path
> 
>
> Key: MESOS-5707
> URL: https://issues.apache.org/jira/browse/MESOS-5707
> Project: Mesos
>  Issue Type: Task
>  Components: security
>Reporter: Adam B
>Assignee: Alexander Rojas
>Priority: Critical
>  Labels: mesosphere, security
> Fix For: 1.0.0
>
>
> Since GET_ENDPOINT_WITH_PATH doesn't (yet) work with any arbitrary path, we 
> should
> a) validate --acls and error if GET_ENDPOINT_WITH_PATH has a path object that 
> doesn't match an endpoint that uses this authz strategy.
> b) document exactly which endpoints support GET_ENDPOINT_WITH_PATH



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-5724) SSL certificate validation should allow IP only verification.

2016-06-28 Thread Till Toenshoff (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Toenshoff reassigned MESOS-5724:
-

Assignee: Till Toenshoff

> SSL certificate validation should allow IP only verification.
> -
>
> Key: MESOS-5724
> URL: https://issues.apache.org/jira/browse/MESOS-5724
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Affects Versions: 0.23.0, 0.23.1, 0.24.0, 0.24.1, 0.24.2, 0.25.0, 0.25.1, 
> 0.26.0, 0.26.1, 0.27.0, 0.27.1, 0.27.2, 0.27.3, 0.28.0, 0.28.1, 0.28.2, 1.0.0
>Reporter: Till Toenshoff
>Assignee: Till Toenshoff
>Priority: Blocker
>  Labels: libprocess, mesosphere, security, ssl
> Fix For: 1.0.0
>
>
> Our SSL certificate validation currently assumes that the host (on connect 
> and on accept) does have a valid hostname. This however is not true for all  
> environments.
> {{process::network::openssl::verify}} currently only allows the validation of 
> a certificate against a hostname. 
> See 
> https://github.com/apache/mesos/blob/08866edd8a71d12f87f4f4dbefa292729efbf6ae/3rdparty/libprocess/src/openssl.cpp#L546
> RFC2818 however says that it should be perfectly valid to validate a 
> certificate  based on the IP address.
> See https://tools.ietf.org/html/rfc2818
> {noformat}
> In some cases, the URI is specified as an IP address rather than a
> hostname. In this case, the iPAddress subjectAltName must be present
> in the certificate and must exactly match the IP in the URI.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5724) SSL certificate validation should allow IP only verification.

2016-06-28 Thread Till Toenshoff (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Toenshoff updated MESOS-5724:
--
Affects Version/s: 0.23.0
   0.23.1
   0.24.0
   0.24.1
   0.24.2
   0.25.0
   0.25.1
   0.26.0
   0.26.1
   0.27.0
   0.27.1
   0.27.2
   0.27.3
   0.28.0
   0.28.1
   0.28.2

> SSL certificate validation should allow IP only verification.
> -
>
> Key: MESOS-5724
> URL: https://issues.apache.org/jira/browse/MESOS-5724
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Affects Versions: 0.23.0, 0.23.1, 0.24.0, 0.24.1, 0.24.2, 0.25.0, 0.25.1, 
> 0.26.0, 0.26.1, 0.27.0, 0.27.1, 0.27.2, 0.27.3, 0.28.0, 0.28.1, 0.28.2, 1.0.0
>Reporter: Till Toenshoff
>Priority: Blocker
>  Labels: libprocess, mesosphere, security, ssl
> Fix For: 1.0.0
>
>
> Our SSL certificate validation currently assumes that the host (on connect 
> and on accept) does have a valid hostname. This however is not true for all  
> environments.
> {{process::network::openssl::verify}} currently only allows the validation of 
> a certificate against a hostname. 
> See 
> https://github.com/apache/mesos/blob/08866edd8a71d12f87f4f4dbefa292729efbf6ae/3rdparty/libprocess/src/openssl.cpp#L546
> RFC2818 however says that it should be perfectly valid to validate a 
> certificate  based on the IP address.
> See https://tools.ietf.org/html/rfc2818
> {noformat}
> In some cases, the URI is specified as an IP address rather than a
> hostname. In this case, the iPAddress subjectAltName must be present
> in the certificate and must exactly match the IP in the URI.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5724) SSL certificate validation should allow IP only verification.

2016-06-28 Thread Till Toenshoff (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15352579#comment-15352579
 ] 

Till Toenshoff commented on MESOS-5724:
---

It was now agreed to do the following:
- add a flag that prevents reverse-lookups for SSL certificate verification
- add another argument to {{verify}} which in turn allows verification of an 
iPAddress based alternaitve name

> SSL certificate validation should allow IP only verification.
> -
>
> Key: MESOS-5724
> URL: https://issues.apache.org/jira/browse/MESOS-5724
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Affects Versions: 1.0.0
>Reporter: Till Toenshoff
>Priority: Blocker
>  Labels: libprocess, mesosphere, security, ssl
> Fix For: 1.0.0
>
>
> Our SSL certificate validation currently assumes that the host (on connect 
> and on accept) does have a valid hostname. This however is not true for all  
> environments.
> {{process::network::openssl::verify}} currently only allows the validation of 
> a certificate against a hostname. 
> See 
> https://github.com/apache/mesos/blob/08866edd8a71d12f87f4f4dbefa292729efbf6ae/3rdparty/libprocess/src/openssl.cpp#L546
> RFC2818 however says that it should be perfectly valid to validate a 
> certificate  based on the IP address.
> See https://tools.ietf.org/html/rfc2818
> {noformat}
> In some cases, the URI is specified as an IP address rather than a
> hostname. In this case, the iPAddress subjectAltName must be present
> in the certificate and must exactly match the IP in the URI.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5724) SSL certificate validation should allow IP only verification.

2016-06-28 Thread Till Toenshoff (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Toenshoff updated MESOS-5724:
--
Shepherd: Joris Van Remoortere

> SSL certificate validation should allow IP only verification.
> -
>
> Key: MESOS-5724
> URL: https://issues.apache.org/jira/browse/MESOS-5724
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Affects Versions: 1.0.0
>Reporter: Till Toenshoff
>Priority: Blocker
>  Labels: libprocess, mesosphere, security, ssl
> Fix For: 1.0.0
>
>
> Our SSL certificate validation currently assumes that the host (on connect 
> and on accept) does have a valid hostname. This however is not true for all  
> environments.
> {{process::network::openssl::verify}} currently only allows the validation of 
> a certificate against a hostname. 
> See 
> https://github.com/apache/mesos/blob/08866edd8a71d12f87f4f4dbefa292729efbf6ae/3rdparty/libprocess/src/openssl.cpp#L546
> RFC2818 however says that it should be perfectly valid to validate a 
> certificate  based on the IP address.
> See https://tools.ietf.org/html/rfc2818
> {noformat}
> In some cases, the URI is specified as an IP address rather than a
> hostname. In this case, the iPAddress subjectAltName must be present
> in the certificate and must exactly match the IP in the URI.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5724) SSL certificate validation should allow IP only verification.

2016-06-28 Thread Till Toenshoff (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Toenshoff updated MESOS-5724:
--
Fix Version/s: 1.0.0

> SSL certificate validation should allow IP only verification.
> -
>
> Key: MESOS-5724
> URL: https://issues.apache.org/jira/browse/MESOS-5724
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Affects Versions: 1.0.0
>Reporter: Till Toenshoff
>Priority: Blocker
>  Labels: libprocess, mesosphere, security, ssl
> Fix For: 1.0.0
>
>
> Our SSL certificate validation currently assumes that the host (on connect 
> and on accept) does have a valid hostname. This however is not true for all  
> environments.
> {{process::network::openssl::verify}} currently only allows the validation of 
> a certificate against a hostname. 
> See 
> https://github.com/apache/mesos/blob/08866edd8a71d12f87f4f4dbefa292729efbf6ae/3rdparty/libprocess/src/openssl.cpp#L546
> RFC2818 however says that it should be perfectly valid to validate a 
> certificate  based on the IP address.
> See https://tools.ietf.org/html/rfc2818
> {noformat}
> In some cases, the URI is specified as an IP address rather than a
> hostname. In this case, the iPAddress subjectAltName must be present
> in the certificate and must exactly match the IP in the URI.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5495) Implement GET_WEIGHTS Call in v1 master API.

2016-06-28 Thread Vinod Kone (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15352450#comment-15352450
 ] 

Vinod Kone commented on MESOS-5495:
---

Author: zhou xing 
Date:   Mon Jun 27 23:26:26 2016 -0700

Renamed method 'getWeights' to 'get' in WeightsHandler.

Review: https://reviews.apache.org/r/49293/



> Implement GET_WEIGHTS Call in v1 master API.
> 
>
> Key: MESOS-5495
> URL: https://issues.apache.org/jira/browse/MESOS-5495
> Project: Mesos
>  Issue Type: Task
>Reporter: Vinod Kone
>Assignee: zhou xing
> Fix For: 1.0.0
>
>
> Review Request: 
> https://reviews.apache.org/r/48924/
> &
> https://reviews.apache.org/r/48925/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5499) Implement RESERVE_RESOURCES Call in v1 master API.

2016-06-28 Thread Abhishek Dasgupta (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15352443#comment-15352443
 ] 

Abhishek Dasgupta commented on MESOS-5499:
--

Additional patch to validate resources for reserve/unreserve call - 
RR : https://reviews.apache.org/r/49296/

> Implement RESERVE_RESOURCES Call in v1 master API.
> --
>
> Key: MESOS-5499
> URL: https://issues.apache.org/jira/browse/MESOS-5499
> Project: Mesos
>  Issue Type: Task
>Reporter: Vinod Kone
>Assignee: Abhishek Dasgupta
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-5718) Mesos UI shows "Taks is in RUNNING status" but can't find it in the mesos Agent.

2016-06-28 Thread chenqiang (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15352427#comment-15352427
 ] 

chenqiang edited comment on MESOS-5718 at 6/28/16 6:17 AM:
---

yes, it was hung in executor terminated.  we upgraded mesos agent to 0.28.2, 
after upgrading and starting mesos-slave.service, the running executors with 
old version would recover when mesos agent registered again. 


was (Author: chenqiang):
yes, it was hung in executor terminated. 

> Mesos UI shows "Taks is in RUNNING status" but can't find it in the mesos 
> Agent.
> 
>
> Key: MESOS-5718
> URL: https://issues.apache.org/jira/browse/MESOS-5718
> Project: Mesos
>  Issue Type: Bug
>Reporter: chenqiang
>
> Now, we find an issue that a task launched by marathon with docker container 
> shows "Task is in RUNNING status" in Mesos UI, but can't find it in the mesos 
> Agent host. Namely, the docker container doesn't exist but the Task is shown 
> As RUNNING in Mesos UI.  so interesting...
> Parts log is attached as belows:
> ```
> I0627 14:31:30.239467  3913 slave.cpp:1912] Asked to kill task 
> tanmenggang.router-web.jylt-online02.532b8817-391f-11e6-93b3-56847afe9799 of 
> framework 20141201-145651-1900714250-5050-3484-
> W0627 14:31:30.239547  3913 slave.cpp:2025] Ignoring kill task 
> tanmenggang.router-web.jylt-online02.532b8817-391f-11e6-93b3-56847afe9799 
> because the executor 
> 'tanmenggang.router-web.jylt-online02.532b8817-391f-11e6-93b3-56847afe9799' 
> of framework 20141201-145651-1900714250-5050-3484- at 
> executor(1)@10.153.96.22:14578 is terminating/terminated
> I0624 14:46:04.398646  3921 slave.cpp:4511] Sending reconnect request to 
> executor 
> 'tanmenggang.router-web.jylt-online02.532b8817-391f-11e6-93b3-56847afe9799' 
> of framework 20141201-145651-1900714250-5050-3484- at 
> executor(1)@10.153.96.22:14578
> I0624 14:46:06.399073  3899 slave.cpp:2991] Killing un-reregistered executor 
> 'tanmenggang.router-web.jylt-online02.532b8817-391f-11e6-93b3-56847afe9799' 
> of framework 20141201-145651-1900714250-5050-3484- at 
> executor(1)@10.153.96.22:14578
> I0624 14:46:06.399183  3899 slave.cpp:4571] Finished recovery
> I0624 14:46:06.399375  3902 docker.cpp:1724] Destroying container 
> 'fa37fc7c-7ef1-478a-81a2-cae38ab3e4cb'
> I0624 14:46:06.399431  3902 docker.cpp:1852] Running docker stop on container 
> 'fa37fc7c-7ef1-478a-81a2-cae38ab3e4cb'
> ``` 
> What's the root cause ? It seems executor of that task is terminated, but the 
> task is ignored kill by slave.
> FIX: After restart mesos-slave, the RUNNING task becomes  in FAILED status, 
> and we can see it is launched again in other Agent, the task restores to 
> normal...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5718) Mesos UI shows "Taks is in RUNNING status" but can't find it in the mesos Agent.

2016-06-28 Thread chenqiang (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15352427#comment-15352427
 ] 

chenqiang commented on MESOS-5718:
--

yes, it was hung in executor terminated. 

> Mesos UI shows "Taks is in RUNNING status" but can't find it in the mesos 
> Agent.
> 
>
> Key: MESOS-5718
> URL: https://issues.apache.org/jira/browse/MESOS-5718
> Project: Mesos
>  Issue Type: Bug
>Reporter: chenqiang
>
> Now, we find an issue that a task launched by marathon with docker container 
> shows "Task is in RUNNING status" in Mesos UI, but can't find it in the mesos 
> Agent host. Namely, the docker container doesn't exist but the Task is shown 
> As RUNNING in Mesos UI.  so interesting...
> Parts log is attached as belows:
> ```
> I0627 14:31:30.239467  3913 slave.cpp:1912] Asked to kill task 
> tanmenggang.router-web.jylt-online02.532b8817-391f-11e6-93b3-56847afe9799 of 
> framework 20141201-145651-1900714250-5050-3484-
> W0627 14:31:30.239547  3913 slave.cpp:2025] Ignoring kill task 
> tanmenggang.router-web.jylt-online02.532b8817-391f-11e6-93b3-56847afe9799 
> because the executor 
> 'tanmenggang.router-web.jylt-online02.532b8817-391f-11e6-93b3-56847afe9799' 
> of framework 20141201-145651-1900714250-5050-3484- at 
> executor(1)@10.153.96.22:14578 is terminating/terminated
> I0624 14:46:04.398646  3921 slave.cpp:4511] Sending reconnect request to 
> executor 
> 'tanmenggang.router-web.jylt-online02.532b8817-391f-11e6-93b3-56847afe9799' 
> of framework 20141201-145651-1900714250-5050-3484- at 
> executor(1)@10.153.96.22:14578
> I0624 14:46:06.399073  3899 slave.cpp:2991] Killing un-reregistered executor 
> 'tanmenggang.router-web.jylt-online02.532b8817-391f-11e6-93b3-56847afe9799' 
> of framework 20141201-145651-1900714250-5050-3484- at 
> executor(1)@10.153.96.22:14578
> I0624 14:46:06.399183  3899 slave.cpp:4571] Finished recovery
> I0624 14:46:06.399375  3902 docker.cpp:1724] Destroying container 
> 'fa37fc7c-7ef1-478a-81a2-cae38ab3e4cb'
> I0624 14:46:06.399431  3902 docker.cpp:1852] Running docker stop on container 
> 'fa37fc7c-7ef1-478a-81a2-cae38ab3e4cb'
> ``` 
> What's the root cause ? It seems executor of that task is terminated, but the 
> task is ignored kill by slave.
> FIX: After restart mesos-slave, the RUNNING task becomes  in FAILED status, 
> and we can see it is launched again in other Agent, the task restores to 
> normal...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)