[jira] [Created] (MESOS-7804) Update the agent reservation table to show per-role allocation information.

2017-07-17 Thread Benjamin Mahler (JIRA)
Benjamin Mahler created MESOS-7804:
--

 Summary: Update the agent reservation table to show per-role 
allocation information.
 Key: MESOS-7804
 URL: https://issues.apache.org/jira/browse/MESOS-7804
 Project: Mesos
  Issue Type: Improvement
  Components: webui
Reporter: Benjamin Mahler


MESOS-6441 introduces a reservation table to the agent page of the webui, but 
this does not show users to which roles the resources are allocated.

The table can be updated to display the allocation roles (see mockup 
[here|https://docs.google.com/spreadsheets/d/19I3gNn5SvRcQLp2Th5yHGvIfFLfZTM1PWkoEQDuuGG4/edit#gid=0]).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-6223) Allow agents to re-register post a host reboot

2017-07-17 Thread Yan Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16090881#comment-16090881
 ] 

Yan Xu commented on MESOS-6223:
---

https://reviews.apache.org/r/60925/

> Allow agents to re-register post a host reboot
> --
>
> Key: MESOS-6223
> URL: https://issues.apache.org/jira/browse/MESOS-6223
> Project: Mesos
>  Issue Type: Improvement
>  Components: agent
>Reporter: Megha Sharma
>Assignee: Megha Sharma
>
> Agent does’t recover its state post a host reboot, it registers with the 
> master and gets a new SlaveID. With partition awareness, the agents are now 
> allowed to re-register after they have been marked Unreachable. The executors 
> are anyway terminated on the agent when it reboots so there is no harm in 
> letting the agent keep its SlaveID, re-register with the master and reconcile 
> the lost executors. This is a pre-requisite for supporting 
> persistent/restartable tasks in mesos (MESOS-3545).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-6390) Ensure Python support scripts are linted

2017-07-17 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16090878#comment-16090878
 ] 

Joseph Wu commented on MESOS-6390:
--

Another chunk: 
{code}
commit c0e51a8d729aa6b06d70fa18259354536bc59b43
Author: Armand Grillet 
Date:   Mon Jul 17 11:16:23 2017 -0700

Linted support/post-reviews.py.

This will allow us to use PyLint on the
entire support directory in the future.

Review: https://reviews.apache.org/r/60233/

commit a536d101a52e21b32cb584d4df9d468e8b76b6a2
Author: Armand Grillet 
Date:   Mon Jul 17 11:41:55 2017 -0700

Linted support/push-commits.py.

This will allow us to use PyLint on the
entire support directory in the future.

Review: https://reviews.apache.org/r/60234/

commit 92896871bdb9f02066a403da789cdb70e6e496e4
Author: Armand Grillet 
Date:   Mon Jul 17 16:33:56 2017 -0700

Linted support/verify-reviews.py.

This will allow us to use PyLint on the
entire support directory in the future.

Review: https://reviews.apache.org/r/60236/
{code}

> Ensure Python support scripts are linted
> 
>
> Key: MESOS-6390
> URL: https://issues.apache.org/jira/browse/MESOS-6390
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Benjamin Bannier
>Assignee: Armand Grillet
>  Labels: newbie, python
>
> Currently {{support/mesos-style.py}} does not lint files under {{support/}}. 
> This is mostly due to the fact that these scripts are too inconsistent 
> style-wise that they wouldn't even pass the linter now.
> We should clean up all Python scripts under {{support/}} so they pass the 
> Python linter, and activate that directory in the linter for future 
> additions. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (MESOS-7786) Create a special "no sender" PID.

2017-07-17 Thread Yan Xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Xu reassigned MESOS-7786:
-

Assignee: (was: Yan Xu)

> Create a special "no sender" PID.
> -
>
> Key: MESOS-7786
> URL: https://issues.apache.org/jira/browse/MESOS-7786
> Project: Mesos
>  Issue Type: Improvement
>  Components: libprocess
>Reporter: Yan Xu
>
> In libprocess we have this "fire and forget" messaging semantics with 
> [process::post|https://github.com/apache/mesos/blob/5645fddf07eab0fcd195a3514e9e5ad3ef25628d/3rdparty/libprocess/include/process/process.hpp#L626-L637]
>  (e.g., MESOS-7753 and a few places in tests).
> Right now these kind of messages use an empty sender PID, i.e., 
> {{@0.0.0.0:0}} which is could fail some checks like MESOS-7401.
> At a high level I think 
> - It's OK if libprocess messages don't have a sender actor, akin to the 
> [no-sender|http://doc.akka.io/japi/akka/current/akka/actor/ActorRef.html#noSender--]
>  concept in Akka.
> - However all remote messages should have a valid source address i.e., ip and 
> port in its {{Libprocess-From}}.
> So we can create a special *no-sender* ID and just make sure real actors 
> can't use this ID. The ID could {{\_\_no\_sender\_\_}} or simply {{\_}}? The 
> UPID would then look like {{_@192.168.1.2:5050}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-3435) Add containerizer support for hyper

2017-07-17 Thread Deshi Xiao (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16090776#comment-16090776
 ] 

Deshi Xiao commented on MESOS-3435:
---

this issue is out of date. [~haosd...@gmail.com]  do you have any update on it? 
close it?

> Add containerizer support for hyper
> ---
>
> Key: MESOS-3435
> URL: https://issues.apache.org/jira/browse/MESOS-3435
> Project: Mesos
>  Issue Type: Story
>Reporter: Deshi Xiao
>Assignee: haosdent
>
> Secure as hypervisor, fast and easily used as Docker. This is hyper. 
> https://docs.hyper.sh/Introduction/what_is_hyper_.html We could implement 
> this through module way once MESOS-3709 finished.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-4812) Mesos fails to escape command health checks

2017-07-17 Thread Deshi Xiao (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16090744#comment-16090744
 ] 

Deshi Xiao commented on MESOS-4812:
---

any update?

> Mesos fails to escape command health checks
> ---
>
> Key: MESOS-4812
> URL: https://issues.apache.org/jira/browse/MESOS-4812
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.25.0
>Reporter: Lukas Loesche
>Assignee: haosdent
>  Labels: health-check, mesosphere, tech-debt
> Attachments: health_task.gif
>
>
> As described in https://github.com/mesosphere/marathon/issues/
> I would like to run a command health check
> {noformat}
> /bin/bash -c " {noformat}
> The health check fails because Mesos, while running the command inside double 
> quotes of a sh -c "" doesn't escape the double quotes in the command.
> If I escape the double quotes myself the command health check succeeds. But 
> this would mean that the user needs intimate knowledge of how Mesos executes 
> his commands which can't be right.
> I was told this is not a Marathon but a Mesos issue so am opening this JIRA. 
> I don't know if this only affects the command health check.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-7067) Add the `OnTerminationPolicy` to the TaskInfo protobuf.

2017-07-17 Thread Deshi Xiao (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16090727#comment-16090727
 ] 

Deshi Xiao commented on MESOS-7067:
---

This change has been discarded?  any update on it.

> Add the `OnTerminationPolicy` to the TaskInfo protobuf.
> ---
>
> Key: MESOS-7067
> URL: https://issues.apache.org/jira/browse/MESOS-7067
> Project: Mesos
>  Issue Type: Task
>Reporter: Anand Mazumdar
>Assignee: Anand Mazumdar
>  Labels: mesosphere
>
> As outlined in the [design doc | 
> https://docs.google.com/document/d/1VxfoZ-DzMHnKY0gzoccHEhx1rvdC2-RATJfJUfiAwGY/edit?usp=sharing]
>  , we need to introduce the {{OnTerminationPolicy}} to the {{TaskInfo}} 
> protobuf allowing every task to specify what would an executor do upon task 
> termination. 
> Note that this issue won't introduce the {{RestartPolicy}} message and those 
> would be added via a separate issue.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-7305) Adjust the recover logic of MesosContainerizer to allow standalone containers.

2017-07-17 Thread Deshi Xiao (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16090724#comment-16090724
 ] 

Deshi Xiao commented on MESOS-7305:
---

where to start the issue.

> Adjust the recover logic of MesosContainerizer to allow standalone containers.
> --
>
> Key: MESOS-7305
> URL: https://issues.apache.org/jira/browse/MESOS-7305
> Project: Mesos
>  Issue Type: Task
>  Components: containerization
>Reporter: Jie Yu
>Assignee: Joseph Wu
>  Labels: mesosphere, storage
>
> The current recovery logic in MesosContainerizer assumes that all top level 
> containers are tied to some Mesos executors. Adding standalone containers 
> will invalid this assumption. The recovery logic must be changed to adapt to 
> that.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-6223) Allow agents to re-register post a host reboot

2017-07-17 Thread Yan Xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Xu updated MESOS-6223:
--
Shepherd: Yan Xu  (was: Vinod Kone)

> Allow agents to re-register post a host reboot
> --
>
> Key: MESOS-6223
> URL: https://issues.apache.org/jira/browse/MESOS-6223
> Project: Mesos
>  Issue Type: Improvement
>  Components: agent
>Reporter: Megha Sharma
>Assignee: Megha Sharma
>
> Agent does’t recover its state post a host reboot, it registers with the 
> master and gets a new SlaveID. With partition awareness, the agents are now 
> allowed to re-register after they have been marked Unreachable. The executors 
> are anyway terminated on the agent when it reboots so there is no harm in 
> letting the agent keep its SlaveID, re-register with the master and reconcile 
> the lost executors. This is a pre-requisite for supporting 
> persistent/restartable tasks in mesos (MESOS-3545).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-6549) Asynchronous dir removal in agent GC

2017-07-17 Thread Yan Xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Xu updated MESOS-6549:
--
Shepherd: Yan Xu
Target Version/s: 1.4.0

> Asynchronous dir removal in agent GC
> 
>
> Key: MESOS-6549
> URL: https://issues.apache.org/jira/browse/MESOS-6549
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Jacob Janco
>Assignee: Jacob Janco
>  Labels: gc
>
> In src/slave/gc.cpp: 
>   // TODO(bmahler): Other dispatches can block waiting for a removal
>   // operation. To fix this, the removal operation can be done
>   // asynchronously in another thread.
> We did see this occur in our clusters where rmdir operations can take seconds 
> to complete, blocking other queued events leading to, for example, long 
> latencies to task launch. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-7802) Push-commits.py support script is too lenient when determining reviews to close

2017-07-17 Thread Joseph Wu (JIRA)
Joseph Wu created MESOS-7802:


 Summary: Push-commits.py support script is too lenient when 
determining reviews to close
 Key: MESOS-7802
 URL: https://issues.apache.org/jira/browse/MESOS-7802
 Project: Mesos
  Issue Type: Bug
Reporter: Joseph Wu
Priority: Minor


The support script {{support/push-commits.py}} can be used by committers to 
push commits and simultaneously close reviews.  However, it is currently quite 
easy to trick the script into closing unrelated reviews.

For example, if you have a commit message like:
{code}
Referring to multiple reviews in one commit message.

Review: https://reviews.apache.org/r/1/
Review: https://reviews.apache.org/r/2/
Review: https://reviews.apache.org/r/3/
Review: https://reviews.apache.org/r/4/
{code}

The script will do this:
{code}
$ support/push-commits.py --dry-run
Found reviews ['1', '2', '3', '4']
Pushing commits to apache
Closing review 1
Closing review 2
Closing review 3
Closing review 4
{code}

It is possible for this to happen non-maliciously, if the contributor's review 
description merely refers to another review in the same format.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-7780) Add `SUBSCRIBE` call handling to the resource provider manager

2017-07-17 Thread Jie Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16090613#comment-16090613
 ] 

Jie Yu commented on MESOS-7780:
---

commit 3c264284e5ae00d431813d015d5f459d01b4e9d0
Author: Jan Schlicht 
Date:   Mon Jul 17 14:38:07 2017 -0700

Added resource provider ID to calls.

The resource provider manager needs to know from which resource provider
calls originated.

Review: https://reviews.apache.org/r/60768/

> Add `SUBSCRIBE` call handling to the resource provider manager
> --
>
> Key: MESOS-7780
> URL: https://issues.apache.org/jira/browse/MESOS-7780
> Project: Mesos
>  Issue Type: Task
>Reporter: Jan Schlicht
>Assignee: Jan Schlicht
>  Labels: storage
>
> Resource providers will use the HTTP API to subscribe to the 
> {{ResourceProviderManager}}. Handling these calls needs to be implemented. On 
> subscription, a unique resource provider ID will be assigned to the resource 
> provider and a {{SUBSCRIBED}} event will be sent.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (MESOS-7492) Introduce a daemon manager in the agent.

2017-07-17 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu reassigned MESOS-7492:
-

Assignee: Joseph Wu

> Introduce a daemon manager in the agent.
> 
>
> Key: MESOS-7492
> URL: https://issues.apache.org/jira/browse/MESOS-7492
> Project: Mesos
>  Issue Type: Task
>Reporter: Jie Yu
>Assignee: Joseph Wu
>  Labels: mesosphere, storage
>
> Once we have standalone container support from the containerizer, we should 
> consider adding a daemon manager inside the agent. It'll be like 'monit', 
> 'upstart' or 'systemd', but with very limited functionalities. For instance, 
> as a start, the manager will simply always restart the daemons if the daemon 
> fails. It'll also try to cleanup unknown daemons.
> This feature will be used to manage CSI plugin containers on the agent.
> The daemon manager should have an interface allowing operators to "register" 
> a daemon with a name and a config of the daemon. The daemon manager is 
> responsible for restarting the daemon if it crashes until some one explicitly 
> "unregister" it. Some simple backoff and health check functionality should be 
> provided.
> We probably need a small design doc for this.
> {code}
> message DaemonConfig {
>   optional ContainerInfo container;
>   optional CommandInfo command;
>   optional uint32 poll_interval;
>   optional uint32 initial_delay;
>   optional CheckInfo check; // For health check.
> }
> class DaemonManager
> {
> public:
>   Future register(
> const ContainerID& containerId,
> const DaemonConfig& config;
>   Future unregister(const ContainerID& containerId);
>   Future ps();
>   Future status(const ContainerID& containerId);
> };
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-7753) `log.LearnedMessage` could be rejected due to being sent from '@0.0.0.0:0'

2017-07-17 Thread Yan Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16090238#comment-16090238
 ] 

Yan Xu commented on MESOS-7753:
---

Eventually decided to address this ticket separately from MESOS-7786. Will just 
use the pid of the {{NetworkProcess}} as the sender.

> `log.LearnedMessage` could be rejected due to being sent from '@0.0.0.0:0'
> --
>
> Key: MESOS-7753
> URL: https://issues.apache.org/jira/browse/MESOS-7753
> Project: Mesos
>  Issue Type: Bug
>  Components: replicated log
>Reporter: Yan Xu
>Assignee: Yan Xu
>
> This is due to the use of 
> https://github.com/apache/mesos/blob/ced7d69767142c912db0c2e01b95c0f5de791bd9/src/log/network.hpp#L247
>  which sets the message's {{from}} field to an empty UPID.
> The rationale for using this form is that there's no intended response for 
> this message. However with MESOS-7401, this message could be rejected.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (MESOS-7753) `log.LearnedMessage` could be rejected due to being sent from '@0.0.0.0:0'

2017-07-17 Thread Yan Xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Xu reassigned MESOS-7753:
-

Assignee: Yan Xu

> `log.LearnedMessage` could be rejected due to being sent from '@0.0.0.0:0'
> --
>
> Key: MESOS-7753
> URL: https://issues.apache.org/jira/browse/MESOS-7753
> Project: Mesos
>  Issue Type: Bug
>  Components: replicated log
>Reporter: Yan Xu
>Assignee: Yan Xu
>
> This is due to the use of 
> https://github.com/apache/mesos/blob/ced7d69767142c912db0c2e01b95c0f5de791bd9/src/log/network.hpp#L247
>  which sets the message's {{from}} field to an empty UPID.
> The rationale for using this form is that there's no intended response for 
> this message. However with MESOS-7401, this message could be rejected.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7753) `log.LearnedMessage` could be rejected due to being sent from '@0.0.0.0:0'

2017-07-17 Thread Joseph Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu updated MESOS-7753:
-
Shepherd: Joseph Wu

> `log.LearnedMessage` could be rejected due to being sent from '@0.0.0.0:0'
> --
>
> Key: MESOS-7753
> URL: https://issues.apache.org/jira/browse/MESOS-7753
> Project: Mesos
>  Issue Type: Bug
>  Components: replicated log
>Reporter: Yan Xu
>
> This is due to the use of 
> https://github.com/apache/mesos/blob/ced7d69767142c912db0c2e01b95c0f5de791bd9/src/log/network.hpp#L247
>  which sets the message's {{from}} field to an empty UPID.
> The rationale for using this form is that there's no intended response for 
> this message. However with MESOS-7401, this message could be rejected.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-7801) Retry logic for unsuccessful `docker rm` during agent recovery

2017-07-17 Thread Chun-Hung Hsiao (JIRA)
Chun-Hung Hsiao created MESOS-7801:
--

 Summary: Retry logic for unsuccessful `docker rm` during agent 
recovery
 Key: MESOS-7801
 URL: https://issues.apache.org/jira/browse/MESOS-7801
 Project: Mesos
  Issue Type: Improvement
  Components: docker
Reporter: Chun-Hung Hsiao
Assignee: Chun-Hung Hsiao
 Fix For: 1.4.0


In MESOS- we skip the failure when `docker rm` fails due to mount leakage 
during agent recovery. In order not to leave residual docker containers in the 
docker daemon, we could do a best-effort `docker rm` retry with an exponential 
backoff since we cannot control when the leakage would be terminated.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7711) Master updates registry for reregistering agents even when they haven't been unreachable

2017-07-17 Thread Yan Xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Xu updated MESOS-7711:
--
Shepherd: James Peach

> Master updates registry for reregistering agents even when they haven't been 
> unreachable
> 
>
> Key: MESOS-7711
> URL: https://issues.apache.org/jira/browse/MESOS-7711
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Reporter: Yan Xu
>Assignee: Yan Xu
>
> During a master failover we observed many registry updates, on average _one 
> per two agents_, as indicated by the log line 
> {noformat:title=}
> I0609 04:46:25.220196 48864 registrar.cpp:550] Successfully updated the 
> registry in 42.904064ms
> {noformat}
> [code|https://github.com/apache/mesos/blob/19a6134d03141dc2cb073a904378c2c129b5138d/src/master/registrar.cpp#L550]
> In this case few agents were ever unreachable so most of them are redundant. 
> Associated with each registry update is also the time spent on applying the 
> operations
> {noformat:title=}
> I0609 04:46:26.475761 48897 registrar.cpp:493] Applied 1 operations in 
> 11.673082ms; attempting to update the registry
> {noformat}
> [code|https://github.com/apache/mesos/blob/19a6134d03141dc2cb073a904378c2c129b5138d/src/master/registrar.cpp#L493]
> Even though not consuming the time of the Master actor, all agent 
> reregistrations are guarded and delayed by these operations, and this could 
> be easily avoided by checking with the {{slaves.recovered}} field in 
> {{Master}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-7800) Tasks with many labels can cause disproportionally huge allocations

2017-07-17 Thread Benjamin Bannier (JIRA)
Benjamin Bannier created MESOS-7800:
---

 Summary: Tasks with many labels can cause disproportionally huge 
allocations
 Key: MESOS-7800
 URL: https://issues.apache.org/jira/browse/MESOS-7800
 Project: Mesos
  Issue Type: Bug
  Components: agent, master
Reporter: Benjamin Bannier


{{mesos.proto}} provides the {{Labels}} message so others can add free-form 
data to a number of messages. In e.g., {{TaskInfo}} and {{ExecutorInfo}} we 
explicitly document
{quote}
Therefore, labels should be used to tag tasks with light-weight meta-data.
{quote}
We however never enforce this requirement.

This becomes e.g., problematic in the agent where a {{TaskInfo}} will likely be 
copied often, e.g., due to multiple levels of dispatches. I have measured that 
a single {{Label}} can trigger 50-100 concurrent copies in flight on the 
agent's container launch path; our general assumption here seems to be that 
while a {{TaskInfo}} is not necessarily small, it still is not huge.

If users embed a lot of data into e.g., {{TaskInfo}} {{labels}} this can lead 
to a temporary explosion of the agent process' memory footprint which can lead 
to it being killed by the OS.

Due to the potential negative effects of huge {{labels}} we should evaluate how 
we can limit the amount of data we accept from users. This could mean limiting 
the size of {{TaskInfo}} or {{Labels}} we accept, measured e.g., by the 
message's {{ByteSizeLong}}. It seems that a value somehow related to 
{{ARG_MAX}} would be intuitive, but am not sure if we can go as low as the 
POSIX-mandated minimum requirement of 4096.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-7792) Add support for ECDH ciphers

2017-07-17 Thread Alexander Rojas (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16089970#comment-16089970
 ] 

Alexander Rojas commented on MESOS-7792:


[r/60913/|https://reviews.apache.org/r/60913/]: Adds support for OpenSSL's ECDH 
handshake.

> Add support for ECDH ciphers
> 
>
> Key: MESOS-7792
> URL: https://issues.apache.org/jira/browse/MESOS-7792
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Affects Versions: 1.3.0
>Reporter: Alexander Rojas
>Assignee: Alexander Rojas
>
> [Elliptic curve 
> ciphers|https://wiki.openssl.org/index.php/Elliptic_Curve_Cryptography] are a 
> family of ciphers supported by OpenSSL. They allow to have smaller keys, but 
> require an extra configuration parameter, the actual curve to be used, which 
> can't be done through libprocess as it is.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7792) Add support for ECDH ciphers

2017-07-17 Thread Alexander Rojas (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rojas updated MESOS-7792:
---
Sprint: Mesosphere Sprint 59

> Add support for ECDH ciphers
> 
>
> Key: MESOS-7792
> URL: https://issues.apache.org/jira/browse/MESOS-7792
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Affects Versions: 1.3.0
>Reporter: Alexander Rojas
>Assignee: Alexander Rojas
>
> [Elliptic curve 
> ciphers|https://wiki.openssl.org/index.php/Elliptic_Curve_Cryptography] are a 
> family of ciphers supported by OpenSSL. They allow to have smaller keys, but 
> require an extra configuration parameter, the actual curve to be used, which 
> can't be done through libprocess as it is.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-6933) Executor does not respect grace period

2017-07-17 Thread Deshi Xiao (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16089848#comment-16089848
 ] 

Deshi Xiao commented on MESOS-6933:
---

[~janisz]  do you can write a testing to cover it? i have no clues to check 
where code to start the fixing.

> Executor does not respect grace period
> --
>
> Key: MESOS-6933
> URL: https://issues.apache.org/jira/browse/MESOS-6933
> Project: Mesos
>  Issue Type: Bug
>  Components: executor
>Reporter: Tomasz Janiszewski
> Attachments: 屏幕快照 2017-07-17 下午2.19.03.png
>
>
> Mesos Command Executor try to support grace period with escalate but 
> unfortunately it does not work. It launches {{command}} by wrapping it in 
> {{sh -c}} this cause process tree to look like this
> {code}
> Received killTask
> Shutting down
> Sending SIGTERM to process tree at pid 18
> Sent SIGTERM to the following process trees:
> [ 
> -+- 18 sh -c cd offer-i18n-0.1.24 && LD_PRELOAD=../librealresources.so 
> ./bin/offer-i18n -e prod -p $PORT0 
>  \--- 19 command...
> ]
> Command terminated with signal Terminated (pid: 18)
> {code}
> This cause {{sh}} to immediately close and so executor, while wrapped 
> {{command}} might need some more time to finish. Finally, executor thinks 
> command executed gracefully so it won't 
> [escalate|https://github.com/apache/mesos/blob/1.1.0/src/launcher/executor.cpp#L695]
>  to SIGKILL.
> This cause leaks when POSIX containerizer is used because if command ignores 
> SIGTERM it will be attached to initialize and never get killed. Using 
> pid/namespace only masks the problem because hanging process is captured 
> before it can gracefully shutdown.
> Fix for this is to sent SIGTERM only to {{sh}} children. {{sh}} will exit 
> when all children processes finish. If not they will be killed by escalation 
> to SIGKILL.
> All versions from 0.20 are affected.
> This test should pass 
> [src/tests/command_executor_tests.cpp:342|https://github.com/apache/mesos/blob/2c856178b59593ff8068ea8d6c6593943c33008c/src/tests/command_executor_tests.cpp#L342-L343]
> [Mailing list 
> thread|https://lists.apache.org/thread.html/1025dca0cf4418aee50b14330711500af864f08b53eb82d10cd5c04c@%3Cuser.mesos.apache.org%3E]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7352) Improve performance of task checks.

2017-07-17 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-7352:
---
Labels: check health-check mesosphere performance  (was: check health-check 
mesosphere)

> Improve performance of task checks.
> ---
>
> Key: MESOS-7352
> URL: https://issues.apache.org/jira/browse/MESOS-7352
> Project: Mesos
>  Issue Type: Epic
>  Components: agent, executor
>Reporter: Alexander Rukletsov
>  Labels: check, health-check, mesosphere, performance
>
> This epic aims to improve performance of checks and health checks in built-in 
> executors.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-6972) Improve performance of protobuf message passing by removing RepeatedPtrField to vector conversion.

2017-07-17 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-6972:
---
Labels: performance tech-debt  (was: tech-debt)

> Improve performance of protobuf message passing by removing RepeatedPtrField 
> to vector conversion.
> --
>
> Key: MESOS-6972
> URL: https://issues.apache.org/jira/browse/MESOS-6972
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Benjamin Mahler
>  Labels: performance, tech-debt
>
> Currently, all protobuf message handlers must take a {{vector}} for repeated 
> fields, rather than a {{RepeatedPtrField}}.
> This requires that a copy be performed of the repeated field's entries (see 
> [here|https://github.com/apache/mesos/blob/9228ebc239dac42825390bebc72053dbf3ae7b09/3rdparty/libprocess/include/process/protobuf.hpp#L78-L87]),
>  which can be very expensive in some cases. We should avoid requiring this 
> expense on the callers.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-3915) Upgrade vendored Boost

2017-07-17 Thread Alexander Rukletsov (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16089771#comment-16089771
 ] 

Alexander Rukletsov commented on MESOS-3915:


[~neilc], shall we close this since all dependent tickets are resolved or keep 
open because the upgrade has never happened?

> Upgrade vendored Boost
> --
>
> Key: MESOS-3915
> URL: https://issues.apache.org/jira/browse/MESOS-3915
> Project: Mesos
>  Issue Type: Bug
>Reporter: Neil Conway
>Priority: Minor
>  Labels: boost, mesosphere, tech-debt
>
> We should upgrade the vendored version of Boost to a newer version. Benefits:
> * -Should properly fix MESOS-688-
> * -Should fix MESOS-3799-
> * Generally speaking, using a more modern version of Boost means we can take 
> advantage of bug fixes, optimizations, and new features.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-7799) Document unit for offer_timeout configuration value

2017-07-17 Thread Vincenzo Pii (JIRA)
Vincenzo Pii created MESOS-7799:
---

 Summary: Document unit for offer_timeout configuration value
 Key: MESOS-7799
 URL: https://issues.apache.org/jira/browse/MESOS-7799
 Project: Mesos
  Issue Type: Documentation
  Components: documentation
Affects Versions: 1.3.0
Reporter: Vincenzo Pii
Priority: Minor


The documentation doesn't mention the unit for the {{offer_timeout}} setting.

The best documentation that I could find was the source code and this answer on 
stackoverflow: 
https://stackoverflow.com/questions/39380903/how-to-set-mesos-master-offer-timeout-config-in-minimesos.

Affected file is {{docs/configuration.md}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-5123) Docker task may fail if path to agent work_dir is relative.

2017-07-17 Thread Alexander Rukletsov (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16089764#comment-16089764
 ] 

Alexander Rukletsov commented on MESOS-5123:


[~jieyu], [~klausma1982], [~gilbert] is it still an issue?

> Docker task may fail if path to agent work_dir is relative. 
> 
>
> Key: MESOS-5123
> URL: https://issues.apache.org/jira/browse/MESOS-5123
> Project: Mesos
>  Issue Type: Improvement
>  Components: docker
>Affects Versions: 1.0.0
>Reporter: Alexander Rukletsov
>Assignee: Klaus Ma
>  Labels: docker, documentation, mesosphere
>
> When a local folder for agent’s {{\-\-work_dir}} is specified (e.g., 
> {{\-\-work_dir=w/s}}) docker complains that there are forbidden symbols in a 
> *local* volume name. Specifying an absolute path (e.g., 
> {{\-\-work_dir=/tmp}}) solves the problem.
> Docker error observed:
> {noformat}
> docker: Error response from daemon: create 
> w/s/slaves/33b8fe47-e9e0-468a-83a6-98c1e3537e59-S1/frameworks/33b8fe47-e9e0-468a-83a6-98c1e3537e59-0001/executors/docker-test/runs/3cc5cb04-d0a9-490e-94d5-d446b66c97cc:
>  volume name invalid: 
> "w/s/slaves/33b8fe47-e9e0-468a-83a6-98c1e3537e59-S1/frameworks/33b8fe47-e9e0-468a-83a6-98c1e3537e59-0001/executors/docker-test/runs/3cc5cb04-d0a9-490e-94d5-d446b66c97cc"
>  includes invalid characters for a local volume name, only 
> "[a-zA-Z0-9][a-zA-Z0-9_.-]" are allowed.
> {noformat}
> First off, it is not obvious that Mesos always creates a volume for the 
> sandbox. We may want to document it.
> Second, it's hard to understand that local {{work_dir}} can trigger forbidden 
> symbols error in docker. Does it make sense to check it during agent launch 
> if docker containerizer is enabled? Or reject docker tasks during task 
> validation?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-5078) Document TaskStatus reasons

2017-07-17 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-5078:
---
Story Points: 3  (was: 1)
  Labels: documentation mesosphere newbie++  (was: documentation 
mesosphere)

> Document TaskStatus reasons
> ---
>
> Key: MESOS-5078
> URL: https://issues.apache.org/jira/browse/MESOS-5078
> Project: Mesos
>  Issue Type: Documentation
>  Components: documentation
>Reporter: Greg Mann
>  Labels: documentation, mesosphere, newbie++
>
> We should document the possible {{reason}} values that can be found in the 
> {{TaskStatus}} message.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-5864) Document MESOS_SANDBOX executor env variable.

2017-07-17 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-5864:
---
Labels: containerizer documentation mesosphere  (was: containerizer 
docuentation mesosphere)

> Document MESOS_SANDBOX executor env variable.
> -
>
> Key: MESOS-5864
> URL: https://issues.apache.org/jira/browse/MESOS-5864
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, documentation
>Reporter: Jie Yu
>Assignee: Gilbert Song
>  Labels: containerizer, documentation, mesosphere
> Fix For: 1.1.0
>
>
> And we should document the difference with MESOS_DIRECTORY.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7454) Document how to use `cgroups/devices`

2017-07-17 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-7454:
---
Labels: documentation  (was: docuentation)

> Document how to use `cgroups/devices`
> -
>
> Key: MESOS-7454
> URL: https://issues.apache.org/jira/browse/MESOS-7454
> Project: Mesos
>  Issue Type: Bug
>Reporter: haosdent
>Assignee: haosdent
>Priority: Minor
>  Labels: documentation
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-5422) Website README.md is out of dated

2017-07-17 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-5422:
---
Labels: documentation  (was: docuentation)

> Website README.md is out of dated
> -
>
> Key: MESOS-5422
> URL: https://issues.apache.org/jira/browse/MESOS-5422
> Project: Mesos
>  Issue Type: Bug
>  Components: project website
>Reporter: haosdent
>Assignee: haosdent
>Priority: Minor
>  Labels: documentation
> Fix For: 1.0.0
>
>
> {quote}
> Tomek Janiszewski via mesos.apache.org 
> 10:15 PM (32 minutes ago)
> to dev 
> Hi
> I think website readme 
> is out of date.
> 1. It doesn't mention mesos-website-container
> 
> 2. support/generate-help-site.py does not exists
> Am I right? How to generate full site (with documentation and getting
> started section)?
> Thanks
> {quote}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-2934) Mesos master crashes when quorum set to 4

2017-07-17 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-2934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-2934:
---
Labels: documentation  (was: documentaion)

> Mesos master crashes when quorum set to 4
> -
>
> Key: MESOS-2934
> URL: https://issues.apache.org/jira/browse/MESOS-2934
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 0.22.1
> Environment: CentOS 7
> Java 1.7.0_55
>Reporter: Craig W
>Priority: Minor
>  Labels: documentation
>
> When deploying 5 mesos masters, with quorum set to 4, the masters start up 
> but fail to stay running. Instead they exit and then restart (Monit is used 
> to supervise the process) within a few seconds. This cycle continues non-stop.
> The logs on the master look like this:
> {noformat}
> Received a recover response from a replica in EMPTY status
> Received a recover response from a replica in EMPTY status
> Replica in EMPTY status received a broadcasted recover request
> Recovery failed: Failed to recover registrar: Failed to perform fetch within 
> 1mins
> Replica in EMPTY status received a broadcasted recover request
> Received a recover response from a replica in EMPTY status
> Received a recover response from a replica in EMPTY status
> Replica in EMPTY status received a broadcasted recover 
> The newly elected leader is master@:5050 with id 
> 20150625-102436-748881418-5050-2157
> Elected as the leading master!
> Recovering from registrar
> Recovering registrar
> Unable to finish the recover protocol in 10secs, retrying
> Unable to finish the recover protocol in 10secs, retrying
> Recovery failed: Failed to recover registrar: Failed to perform fetch within 
> 1mins
> {noformat}
> When I change the quorum to 2 and run just 3 mesos master processes, the 
> cluster stays up without a hitch.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-3749) Configuration docs are missing --enable-libevent and --enable-ssl

2017-07-17 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-3749:
---
Labels: configuration documentation installation mesosphere  (was: 
configuration documentaion installation mesosphere)

> Configuration docs are missing --enable-libevent and --enable-ssl
> -
>
> Key: MESOS-3749
> URL: https://issues.apache.org/jira/browse/MESOS-3749
> Project: Mesos
>  Issue Type: Documentation
>  Components: documentation
>Affects Versions: 0.25.0
>Reporter: Greg Mann
>Assignee: Greg Mann
>  Labels: configuration, documentation, installation, mesosphere
> Fix For: 0.26.0
>
>
> The {{\-\-enable-libevent}} and {{\-\-enable-ssl}} config flags are currently 
> not documented in the "Configuration" docs with the rest of the flags. They 
> should be added.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-4317) Document use of mesos specific future design patterns in gmock test framework

2017-07-17 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-4317:
---
Labels: documentation mesosphere  (was: documentaion mesosphere)

> Document use of mesos specific future design patterns in gmock test framework
> -
>
> Key: MESOS-4317
> URL: https://issues.apache.org/jira/browse/MESOS-4317
> Project: Mesos
>  Issue Type: Documentation
>Reporter: Avinash Sridharan
>Priority: Minor
>  Labels: documentation, mesosphere
>
> Mesos relies heavily on google test and google mock frameworks for its unit 
> test infrastructure. In order to support unit testing of mesos classes that 
> are inherently designed to be multi-threaded (or multi-process), and 
> asynchronous in nature, the libprocess future/promise design patterns have 
> been used to expose a set of API that allow for asynchronous callbacks within 
> the mesos specific gmock test framework 
> (3rdparty/libprocess/include/process/gmock.hpp) . 
> Given that these future/promise based API is very specific to the apache 
> mesos test framework it would be good to have documentation about its 
> use-cases to better inform developers (especially newbies) of this 
> infrastructure.  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-4222) Document containerizer from user perspective.

2017-07-17 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-4222:
---
Labels: documentation mesosphere  (was: documentaion mesosphere)

> Document containerizer from user perspective.
> -
>
> Key: MESOS-4222
> URL: https://issues.apache.org/jira/browse/MESOS-4222
> Project: Mesos
>  Issue Type: Documentation
>  Components: containerization
>Reporter: Jojy Varghese
>Assignee: Jojy Varghese
>  Labels: documentation, mesosphere
>
> Add documentation that covers:
> * Purpose of containerizers from a use case perspective.
> * What purpose does each containerizer (mesos. docker, compose) serve.
> * What criteria could be used to choose a containerizer.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-1611) Getting Started Guide Missing Package

2017-07-17 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-1611:
---
Labels: documentation  (was: documentaion)

> Getting Started Guide Missing Package
> -
>
> Key: MESOS-1611
> URL: https://issues.apache.org/jira/browse/MESOS-1611
> Project: Mesos
>  Issue Type: Bug
>  Components: documentation
>Affects Versions: 0.19.0
> Environment: Ubuntu 12.04 LTS
>Reporter: Daneyon Hansen
>Priority: Minor
>  Labels: documentation
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> I was following the instructions from the getting started guide 
> (http://mesos.apache.org/gettingstarted/) and I could not build mesos without 
> installing python-dev.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-5216) Document docker volume driver isolator.

2017-07-17 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-5216:
---
Labels: documentation mesosphere  (was: documentaion mesosphere)

> Document docker volume driver isolator.
> ---
>
> Key: MESOS-5216
> URL: https://issues.apache.org/jira/browse/MESOS-5216
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: Gilbert Song
>Assignee: Guangya Liu
>  Labels: documentation, mesosphere
> Fix For: 1.0.0
>
>
> Should include the followings:
> 1. What features (driver options) are supported in docker volume driver 
> isolator.
> 2. How to use docker volume driver isolator.
> *related agent flags introduction and usage.
> *isolator dependency clarification (e.g., filesystem/linux).
> *related driver daemon preprocess.
> *volumes pre-specified by users and volume cleanup.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-1002) Add "make check" instruction to getting started doc

2017-07-17 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-1002:
---
Labels: documentation  (was: documentaion)

> Add "make check" instruction to getting started doc
> ---
>
> Key: MESOS-1002
> URL: https://issues.apache.org/jira/browse/MESOS-1002
> Project: Mesos
>  Issue Type: Documentation
>Reporter: Vishal Shah
>Assignee: Vinod Kone
>Priority: Trivial
>  Labels: documentation
> Fix For: 0.19.0
>
>
> Would be great to add "make check" to the getting started doc to help new 
> users build and run the test framework post install.
> https://mesos.apache.org/gettingstarted/
> Thanks to tillt on irc who helped me out. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7349) Document Mesos "check" feature.

2017-07-17 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-7349:
---
Labels: documentation mesosphere  (was: documentaion mesosphere)

> Document Mesos "check" feature.
> ---
>
> Key: MESOS-7349
> URL: https://issues.apache.org/jira/browse/MESOS-7349
> Project: Mesos
>  Issue Type: Documentation
>  Components: documentation
>Reporter: Alexander Rukletsov
>Assignee: Alexander Rukletsov
>  Labels: documentation, mesosphere
>
> This should include framework authors recommendations about how and when to 
> use general checks as well as comparison with health checks.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-6933) Executor does not respect grace period

2017-07-17 Thread Tomasz Janiszewski (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16089547#comment-16089547
 ] 

Tomasz Janiszewski commented on MESOS-6933:
---

I can reproduce it on latest master  
https://github.com/apache/mesos/commit/400d3002d4aa82cbae4b55bced608e95225176e4


> Executor does not respect grace period
> --
>
> Key: MESOS-6933
> URL: https://issues.apache.org/jira/browse/MESOS-6933
> Project: Mesos
>  Issue Type: Bug
>  Components: executor
>Reporter: Tomasz Janiszewski
> Attachments: 屏幕快照 2017-07-17 下午2.19.03.png
>
>
> Mesos Command Executor try to support grace period with escalate but 
> unfortunately it does not work. It launches {{command}} by wrapping it in 
> {{sh -c}} this cause process tree to look like this
> {code}
> Received killTask
> Shutting down
> Sending SIGTERM to process tree at pid 18
> Sent SIGTERM to the following process trees:
> [ 
> -+- 18 sh -c cd offer-i18n-0.1.24 && LD_PRELOAD=../librealresources.so 
> ./bin/offer-i18n -e prod -p $PORT0 
>  \--- 19 command...
> ]
> Command terminated with signal Terminated (pid: 18)
> {code}
> This cause {{sh}} to immediately close and so executor, while wrapped 
> {{command}} might need some more time to finish. Finally, executor thinks 
> command executed gracefully so it won't 
> [escalate|https://github.com/apache/mesos/blob/1.1.0/src/launcher/executor.cpp#L695]
>  to SIGKILL.
> This cause leaks when POSIX containerizer is used because if command ignores 
> SIGTERM it will be attached to initialize and never get killed. Using 
> pid/namespace only masks the problem because hanging process is captured 
> before it can gracefully shutdown.
> Fix for this is to sent SIGTERM only to {{sh}} children. {{sh}} will exit 
> when all children processes finish. If not they will be killed by escalation 
> to SIGKILL.
> All versions from 0.20 are affected.
> This test should pass 
> [src/tests/command_executor_tests.cpp:342|https://github.com/apache/mesos/blob/2c856178b59593ff8068ea8d6c6593943c33008c/src/tests/command_executor_tests.cpp#L342-L343]
> [Mailing list 
> thread|https://lists.apache.org/thread.html/1025dca0cf4418aee50b14330711500af864f08b53eb82d10cd5c04c@%3Cuser.mesos.apache.org%3E]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-6933) Executor does not respect grace period

2017-07-17 Thread Deshi Xiao (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16089452#comment-16089452
 ] 

Deshi Xiao commented on MESOS-6933:
---

[~janisz] yes, i build on upstream mesos code base. it is 1.4.  

> Executor does not respect grace period
> --
>
> Key: MESOS-6933
> URL: https://issues.apache.org/jira/browse/MESOS-6933
> Project: Mesos
>  Issue Type: Bug
>  Components: executor
>Reporter: Tomasz Janiszewski
> Attachments: 屏幕快照 2017-07-17 下午2.19.03.png
>
>
> Mesos Command Executor try to support grace period with escalate but 
> unfortunately it does not work. It launches {{command}} by wrapping it in 
> {{sh -c}} this cause process tree to look like this
> {code}
> Received killTask
> Shutting down
> Sending SIGTERM to process tree at pid 18
> Sent SIGTERM to the following process trees:
> [ 
> -+- 18 sh -c cd offer-i18n-0.1.24 && LD_PRELOAD=../librealresources.so 
> ./bin/offer-i18n -e prod -p $PORT0 
>  \--- 19 command...
> ]
> Command terminated with signal Terminated (pid: 18)
> {code}
> This cause {{sh}} to immediately close and so executor, while wrapped 
> {{command}} might need some more time to finish. Finally, executor thinks 
> command executed gracefully so it won't 
> [escalate|https://github.com/apache/mesos/blob/1.1.0/src/launcher/executor.cpp#L695]
>  to SIGKILL.
> This cause leaks when POSIX containerizer is used because if command ignores 
> SIGTERM it will be attached to initialize and never get killed. Using 
> pid/namespace only masks the problem because hanging process is captured 
> before it can gracefully shutdown.
> Fix for this is to sent SIGTERM only to {{sh}} children. {{sh}} will exit 
> when all children processes finish. If not they will be killed by escalation 
> to SIGKILL.
> All versions from 0.20 are affected.
> This test should pass 
> [src/tests/command_executor_tests.cpp:342|https://github.com/apache/mesos/blob/2c856178b59593ff8068ea8d6c6593943c33008c/src/tests/command_executor_tests.cpp#L342-L343]
> [Mailing list 
> thread|https://lists.apache.org/thread.html/1025dca0cf4418aee50b14330711500af864f08b53eb82d10cd5c04c@%3Cuser.mesos.apache.org%3E]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-6933) Executor does not respect grace period

2017-07-17 Thread Tomasz Janiszewski (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16089403#comment-16089403
 ] 

Tomasz Janiszewski commented on MESOS-6933:
---

[~xds2000] On my setup it works differently. I'm on Mesos 1.3 and when I follow 
steps described above I finish with a state where task is killed but it's still 
running. In logs you provide I see Mesos executor somehow determined that not 
all proceses has exited and sent KILL signal to them. In my case it ends on 
SIGTERM

{code}

Sent SIGTERM to the following process trees:
[
-+- 18776 sh -c /tmp/script.sh
 \-+- 18790 /bin/sh /tmp/script.sh
   \--- 18832 sleep 1
]
Scheduling escalation to SIGKILL in 3secs from now
Terminated
SIGNAL
Command terminated with signal Terminated (pid: 18776)
{code}

> Executor does not respect grace period
> --
>
> Key: MESOS-6933
> URL: https://issues.apache.org/jira/browse/MESOS-6933
> Project: Mesos
>  Issue Type: Bug
>  Components: executor
>Reporter: Tomasz Janiszewski
> Attachments: 屏幕快照 2017-07-17 下午2.19.03.png
>
>
> Mesos Command Executor try to support grace period with escalate but 
> unfortunately it does not work. It launches {{command}} by wrapping it in 
> {{sh -c}} this cause process tree to look like this
> {code}
> Received killTask
> Shutting down
> Sending SIGTERM to process tree at pid 18
> Sent SIGTERM to the following process trees:
> [ 
> -+- 18 sh -c cd offer-i18n-0.1.24 && LD_PRELOAD=../librealresources.so 
> ./bin/offer-i18n -e prod -p $PORT0 
>  \--- 19 command...
> ]
> Command terminated with signal Terminated (pid: 18)
> {code}
> This cause {{sh}} to immediately close and so executor, while wrapped 
> {{command}} might need some more time to finish. Finally, executor thinks 
> command executed gracefully so it won't 
> [escalate|https://github.com/apache/mesos/blob/1.1.0/src/launcher/executor.cpp#L695]
>  to SIGKILL.
> This cause leaks when POSIX containerizer is used because if command ignores 
> SIGTERM it will be attached to initialize and never get killed. Using 
> pid/namespace only masks the problem because hanging process is captured 
> before it can gracefully shutdown.
> Fix for this is to sent SIGTERM only to {{sh}} children. {{sh}} will exit 
> when all children processes finish. If not they will be killed by escalation 
> to SIGKILL.
> All versions from 0.20 are affected.
> This test should pass 
> [src/tests/command_executor_tests.cpp:342|https://github.com/apache/mesos/blob/2c856178b59593ff8068ea8d6c6593943c33008c/src/tests/command_executor_tests.cpp#L342-L343]
> [Mailing list 
> thread|https://lists.apache.org/thread.html/1025dca0cf4418aee50b14330711500af864f08b53eb82d10cd5c04c@%3Cuser.mesos.apache.org%3E]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-6933) Executor does not respect grace period

2017-07-17 Thread Deshi Xiao (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Deshi Xiao updated MESOS-6933:
--
Attachment: 屏幕快照 2017-07-17 下午2.19.03.png

please check the screenshot. [~janisz]

> Executor does not respect grace period
> --
>
> Key: MESOS-6933
> URL: https://issues.apache.org/jira/browse/MESOS-6933
> Project: Mesos
>  Issue Type: Bug
>  Components: executor
>Reporter: Tomasz Janiszewski
> Attachments: 屏幕快照 2017-07-17 下午2.19.03.png
>
>
> Mesos Command Executor try to support grace period with escalate but 
> unfortunately it does not work. It launches {{command}} by wrapping it in 
> {{sh -c}} this cause process tree to look like this
> {code}
> Received killTask
> Shutting down
> Sending SIGTERM to process tree at pid 18
> Sent SIGTERM to the following process trees:
> [ 
> -+- 18 sh -c cd offer-i18n-0.1.24 && LD_PRELOAD=../librealresources.so 
> ./bin/offer-i18n -e prod -p $PORT0 
>  \--- 19 command...
> ]
> Command terminated with signal Terminated (pid: 18)
> {code}
> This cause {{sh}} to immediately close and so executor, while wrapped 
> {{command}} might need some more time to finish. Finally, executor thinks 
> command executed gracefully so it won't 
> [escalate|https://github.com/apache/mesos/blob/1.1.0/src/launcher/executor.cpp#L695]
>  to SIGKILL.
> This cause leaks when POSIX containerizer is used because if command ignores 
> SIGTERM it will be attached to initialize and never get killed. Using 
> pid/namespace only masks the problem because hanging process is captured 
> before it can gracefully shutdown.
> Fix for this is to sent SIGTERM only to {{sh}} children. {{sh}} will exit 
> when all children processes finish. If not they will be killed by escalation 
> to SIGKILL.
> All versions from 0.20 are affected.
> This test should pass 
> [src/tests/command_executor_tests.cpp:342|https://github.com/apache/mesos/blob/2c856178b59593ff8068ea8d6c6593943c33008c/src/tests/command_executor_tests.cpp#L342-L343]
> [Mailing list 
> thread|https://lists.apache.org/thread.html/1025dca0cf4418aee50b14330711500af864f08b53eb82d10cd5c04c@%3Cuser.mesos.apache.org%3E]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)