[jira] [Comment Edited] (MESOS-6918) Prometheus exporter endpoints for metrics

2018-03-06 Thread James Peach (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16389066#comment-16389066
 ] 

James Peach edited comment on MESOS-6918 at 3/7/18 6:01 AM:


{quote}
[~jamespeach], do you think it's feasible to target some of this work for 1.6?
{quote}


Yes I think it's doable.


was (Author: jamespeach):
> [~jamespeach], do you think it's feasible to target some of this work for 1.6?

Yes I think it's doable.

> Prometheus exporter endpoints for metrics
> -
>
> Key: MESOS-6918
> URL: https://issues.apache.org/jira/browse/MESOS-6918
> Project: Mesos
>  Issue Type: Bug
>  Components: statistics
>Reporter: James Peach
>Assignee: James Peach
>Priority: Major
>
> There are a couple of [Prometheus|https://prometheus.io] metrics exporters 
> for Mesos, of varying quality. Since the Mesos stats system actually knows 
> about statistics data types and semantics, and Mesos has reasonable HTTP 
> support we could add Prometheus metrics endpoints to directly expose 
> statistics in [Prometheus wire 
> format|https://prometheus.io/docs/instrumenting/exposition_formats/], 
> removing the need for operators to run separate exporter processes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-6918) Prometheus exporter endpoints for metrics

2018-03-06 Thread James Peach (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16389066#comment-16389066
 ] 

James Peach commented on MESOS-6918:


> [~jamespeach], do you think it's feasible to target some of this work for 1.6?

Yes I think it's doable.

> Prometheus exporter endpoints for metrics
> -
>
> Key: MESOS-6918
> URL: https://issues.apache.org/jira/browse/MESOS-6918
> Project: Mesos
>  Issue Type: Bug
>  Components: statistics
>Reporter: James Peach
>Assignee: James Peach
>Priority: Major
>
> There are a couple of [Prometheus|https://prometheus.io] metrics exporters 
> for Mesos, of varying quality. Since the Mesos stats system actually knows 
> about statistics data types and semantics, and Mesos has reasonable HTTP 
> support we could add Prometheus metrics endpoints to directly expose 
> statistics in [Prometheus wire 
> format|https://prometheus.io/docs/instrumenting/exposition_formats/], 
> removing the need for operators to run separate exporter processes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-6918) Prometheus exporter endpoints for metrics

2018-03-06 Thread James Peach (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16195412#comment-16195412
 ] 

James Peach edited comment on MESOS-6918 at 3/7/18 5:55 AM:


Summary from our discussion:
 - retain the existing {{Timer}} value that holds the duration of the last 
sample
 - capture total duration (monotonic sum) for {{Timers}} in their time series
 - capture total sample count for {{Timers}} in their time series
 - replace the {{Semantics}} enum with a {{monotonic}} marker (enum or bool or 
something)


was (Author: jamespeach):
Summary from our discussion:

- retain the existing {{Timer}} value that holds the duration of the last sample
- capture total duration (monotonic sum) for {{Timer}}s in their time series
- capture total sample count for {{Timer}}s in their time series
- replace the {{Semantics}} enum with a {{monotonic}} marker (enum or bool or 
something)

> Prometheus exporter endpoints for metrics
> -
>
> Key: MESOS-6918
> URL: https://issues.apache.org/jira/browse/MESOS-6918
> Project: Mesos
>  Issue Type: Bug
>  Components: statistics
>Reporter: James Peach
>Assignee: James Peach
>Priority: Major
>
> There are a couple of [Prometheus|https://prometheus.io] metrics exporters 
> for Mesos, of varying quality. Since the Mesos stats system actually knows 
> about statistics data types and semantics, and Mesos has reasonable HTTP 
> support we could add Prometheus metrics endpoints to directly expose 
> statistics in [Prometheus wire 
> format|https://prometheus.io/docs/instrumenting/exposition_formats/], 
> removing the need for operators to run separate exporter processes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8069) Role-related endpoints need to reflect hierarchical accounting.

2018-03-06 Thread Till Toenshoff (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16388785#comment-16388785
 ] 

Till Toenshoff commented on MESOS-8069:
---

 !Screen Shot 2018-03-06 at 15.06.04.png! 

In the second row, we see a framework is registered with role "a/b" and has 
gotten some resources allocated for that role. The first row, role "a" shows 
those resources aggregated for "a" and "a/b". 

> Role-related endpoints need to reflect hierarchical accounting.
> ---
>
> Key: MESOS-8069
> URL: https://issues.apache.org/jira/browse/MESOS-8069
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, HTTP API, master
>Reporter: Benjamin Mahler
>Assignee: Till Toenshoff
>Priority: Major
>  Labels: multitenancy
> Attachments: Screen Shot 2018-03-06 at 15.06.04.png
>
>
> With the introduction of hierarchical roles, the role-related endpoints need 
> to be updated to provide aggregated accounting information.
> For example, information about how many resources are allocated to "/eng" 
> should include the resources allocated to "/eng/frontend" and "/eng/backend", 
> since quota guarantees and limits are also applied on the aggregation.
> This also affects the UI display, for example the 'Roles' tab.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-6128) Make "re-register" vs. "reregister" consistent in the master

2018-03-06 Thread James Peach (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Peach reassigned MESOS-6128:
--

Assignee: James Peach

> Make "re-register" vs. "reregister" consistent in the master
> 
>
> Key: MESOS-6128
> URL: https://issues.apache.org/jira/browse/MESOS-6128
> Project: Mesos
>  Issue Type: Improvement
>  Components: master
>Reporter: Neil Conway
>Assignee: James Peach
>Priority: Trivial
>  Labels: mesosphere, newbie
>
> Per discussion in https://reviews.apache.org/r/50705/, we sometimes use 
> "re-register" in comments and elsewhere we use "reregister". We should pick 
> one form and use it consistently.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-4549) Consider returning `Try` for `os::system`.

2018-03-06 Thread Andrew Schwartzmeyer (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16388704#comment-16388704
 ] 

Andrew Schwartzmeyer edited comment on MESOS-4549 at 3/6/18 11:31 PM:
--

As of

commit 330ddcb51
Author: Akash Gupta akash-gu...@hotmail.com
Date:   Tue Mar 6 13:11:21 2018 -0800

Changed `os::system()` to return `Option` instead of `int`.

The `os::system()` function returned `-1` on error, which is a valid
exit code on Windows, e.g., `os::system("exit -1")`, so it was
impossible to distinguish a failure from a process returning `-1`.
With `Option`, failures will return as `None()`.

Review: https://reviews.apache.org/r/65841/

{{os::system}} now returns an {{Option}}, as {{Try}} isn't usable since it 
uses {{std::string}} for {{Error}}, which isn't async signal safe.

This can be trivially converted to a {{Try}} if we believe the underlying 
{{Error}} in the {{Try}} is safe enough.


was (Author: andschwa):
As of

commit 330ddcb51
Author: Akash Gupta akash-gu...@hotmail.com
Date:   Tue Mar 6 13:11:21 2018 -0800

Changed `os::system()` to return `Option` instead of `int`.

The `os::system()` function returned `-1` on error, which is a valid
exit code on Windows, e.g., `os::system("exit -1")`, so it was
impossible to distinguish a failure from a process returning `-1`.
With `Option`, failures will return as `None()`.

Review: https://reviews.apache.org/r/65841/

{{os::system}} now returns an {{Option}}, as {{Try}} isn't usable since it 
uses {{std::string}} for {{Error}}, which isn't async signal safe.

> Consider returning `Try` for `os::system`.
> --
>
> Key: MESOS-4549
> URL: https://issues.apache.org/jira/browse/MESOS-4549
> Project: Mesos
>  Issue Type: Task
>  Components: stout
>Reporter: Michael Park
>Priority: Minor
>
> The {{os::system}} has the following description:
> {code}
> // Executes a command by calling "/bin/sh -c ", and returns
> // after the command has been completed. Returns 0 if succeeds, and
> // return -1 on error (e.g., fork/exec/waitpid failed). This function
> // is async signal safe. We return int instead of returning a Try
> // because Try involves 'new', which is not async signal safe.
> inline int system(const std::string& command);
> {code}
> Since {{Try}} no longer involves dynamic allocations, we can reconsider 
> returning a {{Try}} out of this function.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-4549) Consider returning `Try` for `os::system`.

2018-03-06 Thread Andrew Schwartzmeyer (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16388704#comment-16388704
 ] 

Andrew Schwartzmeyer commented on MESOS-4549:
-

As of

commit 330ddcb51
Author: Akash Gupta akash-gu...@hotmail.com
Date:   Tue Mar 6 13:11:21 2018 -0800

Changed `os::system()` to return `Option` instead of `int`.

The `os::system()` function returned `-1` on error, which is a valid
exit code on Windows, e.g., `os::system("exit -1")`, so it was
impossible to distinguish a failure from a process returning `-1`.
With `Option`, failures will return as `None()`.

Review: https://reviews.apache.org/r/65841/

{{os::system}} now returns an {{Option}}, as {{Try}} isn't usable since it 
uses {{std::string}} for {{Error}}, which isn't async signal safe.

> Consider returning `Try` for `os::system`.
> --
>
> Key: MESOS-4549
> URL: https://issues.apache.org/jira/browse/MESOS-4549
> Project: Mesos
>  Issue Type: Task
>  Components: stout
>Reporter: Michael Park
>Priority: Minor
>
> The {{os::system}} has the following description:
> {code}
> // Executes a command by calling "/bin/sh -c ", and returns
> // after the command has been completed. Returns 0 if succeeds, and
> // return -1 on error (e.g., fork/exec/waitpid failed). This function
> // is async signal safe. We return int instead of returning a Try
> // because Try involves 'new', which is not async signal safe.
> inline int system(const std::string& command);
> {code}
> Since {{Try}} no longer involves dynamic allocations, we can reconsider 
> returning a {{Try}} out of this function.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-7342) Port Docker tests

2018-03-06 Thread Andrew Schwartzmeyer (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16388702#comment-16388702
 ] 

Andrew Schwartzmeyer commented on MESOS-7342:
-

commit ca357e95f
Author: Akash Gupta 
Date:   Tue Mar 6 13:11:19 2018 -0800

Windows: Fixed `WIFEXITED` and `WIFSIGNALED` stubs.

The `WIFEXITED` and `WIFSIGNALED` macros were incorrectly checking if
the exit code was not -1 to determine if the process exited or was
signaled. However, -1 is a valid return code on Windows, so when logic
like `CHECK(WIFEXITED(status)|| WIFSIGNALED(status))` was used, it
would end up aborting the process accidentally.

For `WIFEXITED`, we simply return `true` because all error codes on
Windows indicate the process exited (if the process instead failed to
spawn, then `os::spawn()` would return `None()` instead of an exit
code).

For `WIFIGNALED`, we simply return `false` for similar reasons. We
assume the process did not exit due to a signal, as that is not an
expected scenario on Windows.

Review: https://reviews.apache.org/r/65840/

> Port Docker tests
> -
>
> Key: MESOS-7342
> URL: https://issues.apache.org/jira/browse/MESOS-7342
> Project: Mesos
>  Issue Type: Bug
>Reporter: Andrew Schwartzmeyer
>Assignee: Akash Gupta
>Priority: Major
>  Labels: docker, windows
>
> While one of Daniel Pravat's last acts was introducing the the Docker 
> containerizer for Windows, we don't have tests. We need to port 
> `docker_tests.cpp` and `docker_containerizer_tests.cpp` to Windows.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-8643) `os::system` and `os::spawn` returns -1 on valid windows commands

2018-03-06 Thread Akash Gupta (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akash Gupta reassigned MESOS-8643:
--

Assignee: Akash Gupta

> `os::system` and `os::spawn` returns -1 on valid windows commands
> -
>
> Key: MESOS-8643
> URL: https://issues.apache.org/jira/browse/MESOS-8643
> Project: Mesos
>  Issue Type: Bug
>Reporter: Akash Gupta
>Assignee: Akash Gupta
>Priority: Major
>
> `os::system` and `os::spawn` return the process exit code or -1 on failure. 
> However, on WIndows, -1 is a valid exit code (e.g. `os::system("exit -1")). 
> It's impossible to distinguish a failure from a process returning -1, so 
> those calls need to return something like a `Try` or `Option` to 
> distinguish the error case.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-8644) W* macros wrong on Windows.

2018-03-06 Thread Akash Gupta (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akash Gupta reassigned MESOS-8644:
--

Assignee: Akash Gupta

> W* macros wrong on Windows.
> ---
>
> Key: MESOS-8644
> URL: https://issues.apache.org/jira/browse/MESOS-8644
> Project: Mesos
>  Issue Type: Bug
>Reporter: Akash Gupta
>Assignee: Akash Gupta
>Priority: Major
>
> The `WIFEXITED` checks if the return code is -1 to determine if the process 
> has exited, but on Windows a process can legitimately return -1 as an exit 
> code. It's especially an issue because parts of the mesos code base use 
> `CHECK(WIFEXITED(exit_code) ... )`, which will throw an assertion error if 
> the exit_code is -1.
>  
> Furthermore, the other W* macros determine signal handling, which doesn't 
> make any sense on Windows and can be misused. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8644) W* macros wrong on Windows.

2018-03-06 Thread Akash Gupta (JIRA)
Akash Gupta created MESOS-8644:
--

 Summary: W* macros wrong on Windows.
 Key: MESOS-8644
 URL: https://issues.apache.org/jira/browse/MESOS-8644
 Project: Mesos
  Issue Type: Bug
Reporter: Akash Gupta


The `WIFEXITED` checks if the return code is -1 to determine if the process has 
exited, but on Windows a process can legitimately return -1 as an exit code. 
It's especially an issue because parts of the mesos code base use 
`CHECK(WIFEXITED(exit_code) ... )`, which will throw an assertion error if the 
exit_code is -1.

 

Furthermore, the other W* macros determine signal handling, which doesn't make 
any sense on Windows and can be misused. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8643) `os::system` and `os::spawn` returns -1 on valid windows commands

2018-03-06 Thread Akash Gupta (JIRA)
Akash Gupta created MESOS-8643:
--

 Summary: `os::system` and `os::spawn` returns -1 on valid windows 
commands
 Key: MESOS-8643
 URL: https://issues.apache.org/jira/browse/MESOS-8643
 Project: Mesos
  Issue Type: Bug
Reporter: Akash Gupta


`os::system` and `os::spawn` return the process exit code or -1 on failure. 
However, on WIndows, -1 is a valid exit code (e.g. `os::system("exit -1")). 
It's impossible to distinguish a failure from a process returning -1, so those 
calls need to return something like a `Try` or `Option` to 
distinguish the error case.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-6918) Prometheus exporter endpoints for metrics

2018-03-06 Thread Zhitao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16388599#comment-16388599
 ] 

Zhitao Li edited comment on MESOS-6918 at 3/6/18 10:07 PM:
---

[~jamespeach], do you think it's feasible to target some of this work for 1.6? 
We are interested in use this format for our monitoring on master/agent.

The issue we have is that we need to hardcode whether a metric is gauge or 
counter because our monitoring system treats them differently, and that hard 
coded list was never maintainable.


was (Author: zhitao):
[~jamespeach], do you think it's feasible to target some of this work for 1.6? 
We are interested in reusing some functionalities here.

> Prometheus exporter endpoints for metrics
> -
>
> Key: MESOS-6918
> URL: https://issues.apache.org/jira/browse/MESOS-6918
> Project: Mesos
>  Issue Type: Bug
>  Components: statistics
>Reporter: James Peach
>Assignee: James Peach
>Priority: Major
>
> There are a couple of [Prometheus|https://prometheus.io] metrics exporters 
> for Mesos, of varying quality. Since the Mesos stats system actually knows 
> about statistics data types and semantics, and Mesos has reasonable HTTP 
> support we could add Prometheus metrics endpoints to directly expose 
> statistics in [Prometheus wire 
> format|https://prometheus.io/docs/instrumenting/exposition_formats/], 
> removing the need for operators to run separate exporter processes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-6918) Prometheus exporter endpoints for metrics

2018-03-06 Thread Zhitao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16388599#comment-16388599
 ] 

Zhitao Li commented on MESOS-6918:
--

[~jamespeach], do you think it's feasible to target some of this work for 1.6? 
We are interested in reusing some functionalities here.

> Prometheus exporter endpoints for metrics
> -
>
> Key: MESOS-6918
> URL: https://issues.apache.org/jira/browse/MESOS-6918
> Project: Mesos
>  Issue Type: Bug
>  Components: statistics
>Reporter: James Peach
>Assignee: James Peach
>Priority: Major
>
> There are a couple of [Prometheus|https://prometheus.io] metrics exporters 
> for Mesos, of varying quality. Since the Mesos stats system actually knows 
> about statistics data types and semantics, and Mesos has reasonable HTTP 
> support we could add Prometheus metrics endpoints to directly expose 
> statistics in [Prometheus wire 
> format|https://prometheus.io/docs/instrumenting/exposition_formats/], 
> removing the need for operators to run separate exporter processes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-4965) Support resizing of an existing persistent volume

2018-03-06 Thread Zhitao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16388552#comment-16388552
 ] 

Zhitao Li edited comment on MESOS-4965 at 3/6/18 9:24 PM:
--

WIP [design 
doc|https://docs.google.com/document/d/1Z16okNG8mlf2eA6NyW_PUmBfNFs_6EOaPzPtwYNVQUQ/edit#]
 (mostly gather information)


was (Author: zhitao):
WIP[ design 
doc|https://docs.google.com/document/d/1Z16okNG8mlf2eA6NyW_PUmBfNFs_6EOaPzPtwYNVQUQ/edit#]
 (mostly gather information)

> Support resizing of an existing persistent volume
> -
>
> Key: MESOS-4965
> URL: https://issues.apache.org/jira/browse/MESOS-4965
> Project: Mesos
>  Issue Type: Improvement
>  Components: storage
>Reporter: Zhitao Li
>Assignee: Zhitao Li
>Priority: Major
>  Labels: mesosphere, persistent-volumes, storage
>
> We need a mechanism to update the size of a persistent volume.
> The increase case is generally more interesting to us (as long as there still 
> available disk resource on the same disk).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-4965) Support resizing of an existing persistent volume

2018-03-06 Thread Zhitao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16388552#comment-16388552
 ] 

Zhitao Li commented on MESOS-4965:
--

WIP[ design 
doc|https://docs.google.com/document/d/1Z16okNG8mlf2eA6NyW_PUmBfNFs_6EOaPzPtwYNVQUQ/edit#]
 (mostly gather information)

> Support resizing of an existing persistent volume
> -
>
> Key: MESOS-4965
> URL: https://issues.apache.org/jira/browse/MESOS-4965
> Project: Mesos
>  Issue Type: Improvement
>  Components: storage
>Reporter: Zhitao Li
>Assignee: Zhitao Li
>Priority: Major
>  Labels: mesosphere, persistent-volumes, storage
>
> We need a mechanism to update the size of a persistent volume.
> The increase case is generally more interesting to us (as long as there still 
> available disk resource on the same disk).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8530) Default executor tasks can get stuck in KILLING state

2018-03-06 Thread JIRA

[ 
https://issues.apache.org/jira/browse/MESOS-8530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16388262#comment-16388262
 ] 

Gastón Kleiman commented on MESOS-8530:
---

[~kaysoky] hey, do you think you'll have time to review the chain this week?

> Default executor tasks can get stuck in KILLING state
> -
>
> Key: MESOS-8530
> URL: https://issues.apache.org/jira/browse/MESOS-8530
> Project: Mesos
>  Issue Type: Bug
>  Components: executor
>Affects Versions: 1.2.3, 1.3.1, 1.4.1, 1.5.0
>Reporter: Gastón Kleiman
>Assignee: Gastón Kleiman
>Priority: Critical
>  Labels: default-executor, mesosphere
>
> The default executor will transition a task to {{TASK_KILLING}} and mark its 
> container as being killed before issuing the {{KILL_NESTED_CONTAINER}} call.
> If the kill call fails, the task will get stuck in {{TASK_KILLING}}, and the 
> executor won't allow retrying the kill.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8641) New heartbeat on event stream could change the behavior for subscriber

2018-03-06 Thread Zhitao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16388094#comment-16388094
 ] 

Zhitao Li commented on MESOS-8641:
--

Attempt to fix: 

https://reviews.apache.org/r/65930

> New heartbeat on event stream could change the behavior for subscriber
> --
>
> Key: MESOS-8641
> URL: https://issues.apache.org/jira/browse/MESOS-8641
> Project: Mesos
>  Issue Type: Bug
>  Components: HTTP API
>Affects Versions: 1.5.0
>Reporter: Zhitao Li
>Assignee: Zhitao Li
>Priority: Major
>
> A new event for heartbeat is added in 
> [MESOS-7695|https://reviews.apache.org/r/61262/bugs/MESOS-7695/], but I 
> believe the implementation in [https://reviews.apache.org/r/61262/] can 
> trigger a corner case and send *_HEARTBEAT_* before _*SUBSCRIBED*_
>  
> I would consider this a behavior change for the customer and I propose we 
> change the order as I suggest in the review to preserve previous behavior 
> (since the subscriber needs to see the _*SUBSCRIBED*_ event to really know 
> how it should respond to *_HEARTBEAT_* message anyway)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8382) Master should bookkeep local resource providers.

2018-03-06 Thread Benjamin Bannier (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16388055#comment-16388055
 ] 

Benjamin Bannier commented on MESOS-8382:
-

{noformat}
commit 0d247c3887ea08b6273992218cd5899010d89fed
Author: Benjamin Bannier 
Date: Tue Mar 6 16:02:00 2018 +0100

Used proto UUID instead stout UUID internally for operation IDs.

Review: https://reviews.apache.org/r/65588/

commit 4c4ee4575667e721f710cbf5a09ba3ec94001672
Author: Benjamin Bannier 
Date: Tue Mar 6 16:01:55 2018 +0100

Added hash function for mesos::UUID.

Review: https://reviews.apache.org/r/65587/
{noformat}

> Master should bookkeep local resource providers.
> 
>
> Key: MESOS-8382
> URL: https://issues.apache.org/jira/browse/MESOS-8382
> Project: Mesos
>  Issue Type: Task
>Reporter: Jie Yu
>Assignee: Benjamin Bannier
>Priority: Major
>  Labels: mesosphere, storage
>   Original Estimate: 5m
>  Remaining Estimate: 5m
>
> This will simplify the handling of `UpdateSlaveMessage`. ALso, it'll simplify 
> the endpoint serving.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8124) PosixRLimitsIsolatorTest.TaskExceedingLimit is flaky.

2018-03-06 Thread Benjamin Bannier (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16388048#comment-16388048
 ] 

Benjamin Bannier commented on MESOS-8124:
-

Another failure from parallel test execution:
{noformat}
1 out of 1903 tests failed

[ RUN ] PosixRLimitsIsolatorTest.TaskExceedingLimit
../src/tests/containerizer/posix_rlimits_isolator_tests.cpp:342: Failure
Expected: TASK_STARTING
To be equal to: statusStarting->state()
Which is: TASK_FAILED
Ready
../src/tests/containerizer/posix_rlimits_isolator_tests.cpp:344: Failure
Failed to wait 15secs for statusRunning
../src/tests/containerizer/posix_rlimits_isolator_tests.cpp:333: Failure
Actual function call count doesn't match EXPECT_CALL(sched, 
statusUpdate(&driver, _))...
Expected: to be called 3 times
Actual: called once - unsatisfied and active
[ FAILED ] PosixRLimitsIsolatorTest.TaskExceedingLimit (16986 ms)
{noformat}
It might make sense to think of a not time-based limit to reduce this kind of 
flakiness.

> PosixRLimitsIsolatorTest.TaskExceedingLimit is flaky.
> -
>
> Key: MESOS-8124
> URL: https://issues.apache.org/jira/browse/MESOS-8124
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Reporter: Benjamin Mahler
>Priority: Major
>  Labels: flaky-test
> Attachments: failed.txt, success.txt
>
>
> This test fails flaky on CI:
> {noformat}
> ../../src/tests/containerizer/posix_rlimits_isolator_tests.cpp:348: Failure
> Failed to wait 15secs for statusFailed
> ../../src/tests/containerizer/posix_rlimits_isolator_tests.cpp:333: Failure
> Actual function call count doesn't match EXPECT_CALL(sched, 
> statusUpdate(&driver, _))...
>  Expected: to be called 3 times
>Actual: called twice - unsatisfied and active
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8642) ballon-executor is hard to run as unprivileged user

2018-03-06 Thread Benjamin Bannier (JIRA)
Benjamin Bannier created MESOS-8642:
---

 Summary: ballon-executor is hard to run as unprivileged user
 Key: MESOS-8642
 URL: https://issues.apache.org/jira/browse/MESOS-8642
 Project: Mesos
  Issue Type: Bug
 Environment: The {{balloon-executor}} currently requires the ability 
to {{mlock}} large amounts of memory in order to prevent swapping. Since the 
amount of memory users can {{mlock}} is controlled by a rlimits this can make 
it harder than needed to run this executor as an unprivileged user.

It should at least be possible to drop the {{mlock}}'ing completely if the host 
system uses no swap.
Reporter: Benjamin Bannier
Assignee: Benjamin Bannier






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8593) Support credential updates in Docker config without restarting the agent

2018-03-06 Thread Kshitiz Bakshi (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16387683#comment-16387683
 ] 

Kshitiz Bakshi commented on MESOS-8593:
---

Hi, 
We're users of DC/OS Community edition, and hence our interface for Mesos is 
via Marathon. DC/OS Community Edition does not provide the support to use 
different image pull secrets for each task. 

In Docker containerizer case, everything already works by changing the creds 
file because Docker reads credentials on every image pull. We have our tooling 
to refresh the credentials. Migrating to UCR is blocked for us because of this 
issue, as mesos-agent does not read the passed file on each image pull.

> Support credential updates in Docker config without restarting the agent
> 
>
> Key: MESOS-8593
> URL: https://issues.apache.org/jira/browse/MESOS-8593
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization, docker
>Reporter: Jan Schlicht
>Priority: Major
>
> When using the Mesos containerizer with a private Docker repository with 
> {{--docker_config}} option, the repository might expire credentials after 
> some time, forcing the user to login again. In that case the Docker config in 
> use will change and the agent needs to be restarted to reflect the change. 
> Instead of restarting, the agent could reload the Docker config file every 
> time before fetching.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8497) Docker parameter `name` does not work with Docker Containerizer.

2018-03-06 Thread Qian Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16387460#comment-16387460
 ] 

Qian Zhang commented on MESOS-8497:
---

RR: https://reviews.apache.org/r/65918/

> Docker parameter `name` does not work with Docker Containerizer.
> 
>
> Key: MESOS-8497
> URL: https://issues.apache.org/jira/browse/MESOS-8497
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: Jörg Schad
>Assignee: Qian Zhang
>Priority: Critical
>  Labels: containerizer
> Attachments: agent.log, master.log
>
>
> When deploying a marathon app with Docker Containerizer (need to check Mesos 
> Containerizer) and the parameter name set, Mesos is not able to 
> recognize/control/kill the started container.
> Steps to reproduce 
>  # Deploy the below marathon app definition
>  #  Watch task being stuck in staging and mesos not being able to kill 
> it/communicate with it
>  ## 
> {quote}e.g., Agent Logs: W0126 18:38:50.00  4988 slave.cpp:6750] Failed 
> to get resource statistics for executor 
> ‘instana-agent.1a1f8d22-02c8-11e8-b607-923c3c523109’ of framework 
> 41f1b534-5f9d-4b5e-bb74-a0e387d5739f-0001: Failed to run ‘docker -H 
> unix:///var/run/docker.sock inspect 
> mesos-1c6f894d-9a3e-408c-8146-47ebab2f28be’: exited with status 1; 
> stderr=’Error: No such image, container or task: 
> mesos-1c6f894d-9a3e-408c-8146-47ebab2f28be{quote}
>  # Check on node and see container running, but not being recognized by mesos
> {noformat}
> {
> "id": "/docker-test",
> "instances": 1,
> "portDefinitions": [],
> "container": {
> "type": "DOCKER",
> "volumes": [],
> "docker": {
> "image": "ubuntu:16.04",
> "parameters": [
> {
> "key": "name",
> "value": "myname"
> }
> ]
> }
> },
> "cpus": 0.1,
> "mem": 128,
> "requirePorts": false,
> "networks": [],
> "healthChecks": [],
> "fetch": [],
> "constraints": [],
> "cmd": "sleep 1000"
> }
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)