[jira] [Comment Edited] (MESOS-6918) Prometheus exporter endpoints for metrics
[ https://issues.apache.org/jira/browse/MESOS-6918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16389066#comment-16389066 ] James Peach edited comment on MESOS-6918 at 3/7/18 6:01 AM: {quote} [~jamespeach], do you think it's feasible to target some of this work for 1.6? {quote} Yes I think it's doable. was (Author: jamespeach): > [~jamespeach], do you think it's feasible to target some of this work for 1.6? Yes I think it's doable. > Prometheus exporter endpoints for metrics > - > > Key: MESOS-6918 > URL: https://issues.apache.org/jira/browse/MESOS-6918 > Project: Mesos > Issue Type: Bug > Components: statistics >Reporter: James Peach >Assignee: James Peach >Priority: Major > > There are a couple of [Prometheus|https://prometheus.io] metrics exporters > for Mesos, of varying quality. Since the Mesos stats system actually knows > about statistics data types and semantics, and Mesos has reasonable HTTP > support we could add Prometheus metrics endpoints to directly expose > statistics in [Prometheus wire > format|https://prometheus.io/docs/instrumenting/exposition_formats/], > removing the need for operators to run separate exporter processes. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-6918) Prometheus exporter endpoints for metrics
[ https://issues.apache.org/jira/browse/MESOS-6918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16389066#comment-16389066 ] James Peach commented on MESOS-6918: > [~jamespeach], do you think it's feasible to target some of this work for 1.6? Yes I think it's doable. > Prometheus exporter endpoints for metrics > - > > Key: MESOS-6918 > URL: https://issues.apache.org/jira/browse/MESOS-6918 > Project: Mesos > Issue Type: Bug > Components: statistics >Reporter: James Peach >Assignee: James Peach >Priority: Major > > There are a couple of [Prometheus|https://prometheus.io] metrics exporters > for Mesos, of varying quality. Since the Mesos stats system actually knows > about statistics data types and semantics, and Mesos has reasonable HTTP > support we could add Prometheus metrics endpoints to directly expose > statistics in [Prometheus wire > format|https://prometheus.io/docs/instrumenting/exposition_formats/], > removing the need for operators to run separate exporter processes. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (MESOS-6918) Prometheus exporter endpoints for metrics
[ https://issues.apache.org/jira/browse/MESOS-6918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16195412#comment-16195412 ] James Peach edited comment on MESOS-6918 at 3/7/18 5:55 AM: Summary from our discussion: - retain the existing {{Timer}} value that holds the duration of the last sample - capture total duration (monotonic sum) for {{Timers}} in their time series - capture total sample count for {{Timers}} in their time series - replace the {{Semantics}} enum with a {{monotonic}} marker (enum or bool or something) was (Author: jamespeach): Summary from our discussion: - retain the existing {{Timer}} value that holds the duration of the last sample - capture total duration (monotonic sum) for {{Timer}}s in their time series - capture total sample count for {{Timer}}s in their time series - replace the {{Semantics}} enum with a {{monotonic}} marker (enum or bool or something) > Prometheus exporter endpoints for metrics > - > > Key: MESOS-6918 > URL: https://issues.apache.org/jira/browse/MESOS-6918 > Project: Mesos > Issue Type: Bug > Components: statistics >Reporter: James Peach >Assignee: James Peach >Priority: Major > > There are a couple of [Prometheus|https://prometheus.io] metrics exporters > for Mesos, of varying quality. Since the Mesos stats system actually knows > about statistics data types and semantics, and Mesos has reasonable HTTP > support we could add Prometheus metrics endpoints to directly expose > statistics in [Prometheus wire > format|https://prometheus.io/docs/instrumenting/exposition_formats/], > removing the need for operators to run separate exporter processes. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8069) Role-related endpoints need to reflect hierarchical accounting.
[ https://issues.apache.org/jira/browse/MESOS-8069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16388785#comment-16388785 ] Till Toenshoff commented on MESOS-8069: --- !Screen Shot 2018-03-06 at 15.06.04.png! In the second row, we see a framework is registered with role "a/b" and has gotten some resources allocated for that role. The first row, role "a" shows those resources aggregated for "a" and "a/b". > Role-related endpoints need to reflect hierarchical accounting. > --- > > Key: MESOS-8069 > URL: https://issues.apache.org/jira/browse/MESOS-8069 > Project: Mesos > Issue Type: Bug > Components: agent, HTTP API, master >Reporter: Benjamin Mahler >Assignee: Till Toenshoff >Priority: Major > Labels: multitenancy > Attachments: Screen Shot 2018-03-06 at 15.06.04.png > > > With the introduction of hierarchical roles, the role-related endpoints need > to be updated to provide aggregated accounting information. > For example, information about how many resources are allocated to "/eng" > should include the resources allocated to "/eng/frontend" and "/eng/backend", > since quota guarantees and limits are also applied on the aggregation. > This also affects the UI display, for example the 'Roles' tab. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-6128) Make "re-register" vs. "reregister" consistent in the master
[ https://issues.apache.org/jira/browse/MESOS-6128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Peach reassigned MESOS-6128: -- Assignee: James Peach > Make "re-register" vs. "reregister" consistent in the master > > > Key: MESOS-6128 > URL: https://issues.apache.org/jira/browse/MESOS-6128 > Project: Mesos > Issue Type: Improvement > Components: master >Reporter: Neil Conway >Assignee: James Peach >Priority: Trivial > Labels: mesosphere, newbie > > Per discussion in https://reviews.apache.org/r/50705/, we sometimes use > "re-register" in comments and elsewhere we use "reregister". We should pick > one form and use it consistently. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (MESOS-4549) Consider returning `Try` for `os::system`.
[ https://issues.apache.org/jira/browse/MESOS-4549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16388704#comment-16388704 ] Andrew Schwartzmeyer edited comment on MESOS-4549 at 3/6/18 11:31 PM: -- As of commit 330ddcb51 Author: Akash Gupta akash-gu...@hotmail.com Date: Tue Mar 6 13:11:21 2018 -0800 Changed `os::system()` to return `Option` instead of `int`. The `os::system()` function returned `-1` on error, which is a valid exit code on Windows, e.g., `os::system("exit -1")`, so it was impossible to distinguish a failure from a process returning `-1`. With `Option`, failures will return as `None()`. Review: https://reviews.apache.org/r/65841/ {{os::system}} now returns an {{Option}}, as {{Try}} isn't usable since it uses {{std::string}} for {{Error}}, which isn't async signal safe. This can be trivially converted to a {{Try}} if we believe the underlying {{Error}} in the {{Try}} is safe enough. was (Author: andschwa): As of commit 330ddcb51 Author: Akash Gupta akash-gu...@hotmail.com Date: Tue Mar 6 13:11:21 2018 -0800 Changed `os::system()` to return `Option` instead of `int`. The `os::system()` function returned `-1` on error, which is a valid exit code on Windows, e.g., `os::system("exit -1")`, so it was impossible to distinguish a failure from a process returning `-1`. With `Option`, failures will return as `None()`. Review: https://reviews.apache.org/r/65841/ {{os::system}} now returns an {{Option}}, as {{Try}} isn't usable since it uses {{std::string}} for {{Error}}, which isn't async signal safe. > Consider returning `Try` for `os::system`. > -- > > Key: MESOS-4549 > URL: https://issues.apache.org/jira/browse/MESOS-4549 > Project: Mesos > Issue Type: Task > Components: stout >Reporter: Michael Park >Priority: Minor > > The {{os::system}} has the following description: > {code} > // Executes a command by calling "/bin/sh -c ", and returns > // after the command has been completed. Returns 0 if succeeds, and > // return -1 on error (e.g., fork/exec/waitpid failed). This function > // is async signal safe. We return int instead of returning a Try > // because Try involves 'new', which is not async signal safe. > inline int system(const std::string& command); > {code} > Since {{Try}} no longer involves dynamic allocations, we can reconsider > returning a {{Try}} out of this function. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-4549) Consider returning `Try` for `os::system`.
[ https://issues.apache.org/jira/browse/MESOS-4549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16388704#comment-16388704 ] Andrew Schwartzmeyer commented on MESOS-4549: - As of commit 330ddcb51 Author: Akash Gupta akash-gu...@hotmail.com Date: Tue Mar 6 13:11:21 2018 -0800 Changed `os::system()` to return `Option` instead of `int`. The `os::system()` function returned `-1` on error, which is a valid exit code on Windows, e.g., `os::system("exit -1")`, so it was impossible to distinguish a failure from a process returning `-1`. With `Option`, failures will return as `None()`. Review: https://reviews.apache.org/r/65841/ {{os::system}} now returns an {{Option}}, as {{Try}} isn't usable since it uses {{std::string}} for {{Error}}, which isn't async signal safe. > Consider returning `Try` for `os::system`. > -- > > Key: MESOS-4549 > URL: https://issues.apache.org/jira/browse/MESOS-4549 > Project: Mesos > Issue Type: Task > Components: stout >Reporter: Michael Park >Priority: Minor > > The {{os::system}} has the following description: > {code} > // Executes a command by calling "/bin/sh -c ", and returns > // after the command has been completed. Returns 0 if succeeds, and > // return -1 on error (e.g., fork/exec/waitpid failed). This function > // is async signal safe. We return int instead of returning a Try > // because Try involves 'new', which is not async signal safe. > inline int system(const std::string& command); > {code} > Since {{Try}} no longer involves dynamic allocations, we can reconsider > returning a {{Try}} out of this function. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-7342) Port Docker tests
[ https://issues.apache.org/jira/browse/MESOS-7342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16388702#comment-16388702 ] Andrew Schwartzmeyer commented on MESOS-7342: - commit ca357e95f Author: Akash Gupta Date: Tue Mar 6 13:11:19 2018 -0800 Windows: Fixed `WIFEXITED` and `WIFSIGNALED` stubs. The `WIFEXITED` and `WIFSIGNALED` macros were incorrectly checking if the exit code was not -1 to determine if the process exited or was signaled. However, -1 is a valid return code on Windows, so when logic like `CHECK(WIFEXITED(status)|| WIFSIGNALED(status))` was used, it would end up aborting the process accidentally. For `WIFEXITED`, we simply return `true` because all error codes on Windows indicate the process exited (if the process instead failed to spawn, then `os::spawn()` would return `None()` instead of an exit code). For `WIFIGNALED`, we simply return `false` for similar reasons. We assume the process did not exit due to a signal, as that is not an expected scenario on Windows. Review: https://reviews.apache.org/r/65840/ > Port Docker tests > - > > Key: MESOS-7342 > URL: https://issues.apache.org/jira/browse/MESOS-7342 > Project: Mesos > Issue Type: Bug >Reporter: Andrew Schwartzmeyer >Assignee: Akash Gupta >Priority: Major > Labels: docker, windows > > While one of Daniel Pravat's last acts was introducing the the Docker > containerizer for Windows, we don't have tests. We need to port > `docker_tests.cpp` and `docker_containerizer_tests.cpp` to Windows. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-8643) `os::system` and `os::spawn` returns -1 on valid windows commands
[ https://issues.apache.org/jira/browse/MESOS-8643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akash Gupta reassigned MESOS-8643: -- Assignee: Akash Gupta > `os::system` and `os::spawn` returns -1 on valid windows commands > - > > Key: MESOS-8643 > URL: https://issues.apache.org/jira/browse/MESOS-8643 > Project: Mesos > Issue Type: Bug >Reporter: Akash Gupta >Assignee: Akash Gupta >Priority: Major > > `os::system` and `os::spawn` return the process exit code or -1 on failure. > However, on WIndows, -1 is a valid exit code (e.g. `os::system("exit -1")). > It's impossible to distinguish a failure from a process returning -1, so > those calls need to return something like a `Try` or `Option` to > distinguish the error case. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-8644) W* macros wrong on Windows.
[ https://issues.apache.org/jira/browse/MESOS-8644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akash Gupta reassigned MESOS-8644: -- Assignee: Akash Gupta > W* macros wrong on Windows. > --- > > Key: MESOS-8644 > URL: https://issues.apache.org/jira/browse/MESOS-8644 > Project: Mesos > Issue Type: Bug >Reporter: Akash Gupta >Assignee: Akash Gupta >Priority: Major > > The `WIFEXITED` checks if the return code is -1 to determine if the process > has exited, but on Windows a process can legitimately return -1 as an exit > code. It's especially an issue because parts of the mesos code base use > `CHECK(WIFEXITED(exit_code) ... )`, which will throw an assertion error if > the exit_code is -1. > > Furthermore, the other W* macros determine signal handling, which doesn't > make any sense on Windows and can be misused. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-8644) W* macros wrong on Windows.
Akash Gupta created MESOS-8644: -- Summary: W* macros wrong on Windows. Key: MESOS-8644 URL: https://issues.apache.org/jira/browse/MESOS-8644 Project: Mesos Issue Type: Bug Reporter: Akash Gupta The `WIFEXITED` checks if the return code is -1 to determine if the process has exited, but on Windows a process can legitimately return -1 as an exit code. It's especially an issue because parts of the mesos code base use `CHECK(WIFEXITED(exit_code) ... )`, which will throw an assertion error if the exit_code is -1. Furthermore, the other W* macros determine signal handling, which doesn't make any sense on Windows and can be misused. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-8643) `os::system` and `os::spawn` returns -1 on valid windows commands
Akash Gupta created MESOS-8643: -- Summary: `os::system` and `os::spawn` returns -1 on valid windows commands Key: MESOS-8643 URL: https://issues.apache.org/jira/browse/MESOS-8643 Project: Mesos Issue Type: Bug Reporter: Akash Gupta `os::system` and `os::spawn` return the process exit code or -1 on failure. However, on WIndows, -1 is a valid exit code (e.g. `os::system("exit -1")). It's impossible to distinguish a failure from a process returning -1, so those calls need to return something like a `Try` or `Option` to distinguish the error case. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (MESOS-6918) Prometheus exporter endpoints for metrics
[ https://issues.apache.org/jira/browse/MESOS-6918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16388599#comment-16388599 ] Zhitao Li edited comment on MESOS-6918 at 3/6/18 10:07 PM: --- [~jamespeach], do you think it's feasible to target some of this work for 1.6? We are interested in use this format for our monitoring on master/agent. The issue we have is that we need to hardcode whether a metric is gauge or counter because our monitoring system treats them differently, and that hard coded list was never maintainable. was (Author: zhitao): [~jamespeach], do you think it's feasible to target some of this work for 1.6? We are interested in reusing some functionalities here. > Prometheus exporter endpoints for metrics > - > > Key: MESOS-6918 > URL: https://issues.apache.org/jira/browse/MESOS-6918 > Project: Mesos > Issue Type: Bug > Components: statistics >Reporter: James Peach >Assignee: James Peach >Priority: Major > > There are a couple of [Prometheus|https://prometheus.io] metrics exporters > for Mesos, of varying quality. Since the Mesos stats system actually knows > about statistics data types and semantics, and Mesos has reasonable HTTP > support we could add Prometheus metrics endpoints to directly expose > statistics in [Prometheus wire > format|https://prometheus.io/docs/instrumenting/exposition_formats/], > removing the need for operators to run separate exporter processes. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-6918) Prometheus exporter endpoints for metrics
[ https://issues.apache.org/jira/browse/MESOS-6918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16388599#comment-16388599 ] Zhitao Li commented on MESOS-6918: -- [~jamespeach], do you think it's feasible to target some of this work for 1.6? We are interested in reusing some functionalities here. > Prometheus exporter endpoints for metrics > - > > Key: MESOS-6918 > URL: https://issues.apache.org/jira/browse/MESOS-6918 > Project: Mesos > Issue Type: Bug > Components: statistics >Reporter: James Peach >Assignee: James Peach >Priority: Major > > There are a couple of [Prometheus|https://prometheus.io] metrics exporters > for Mesos, of varying quality. Since the Mesos stats system actually knows > about statistics data types and semantics, and Mesos has reasonable HTTP > support we could add Prometheus metrics endpoints to directly expose > statistics in [Prometheus wire > format|https://prometheus.io/docs/instrumenting/exposition_formats/], > removing the need for operators to run separate exporter processes. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (MESOS-4965) Support resizing of an existing persistent volume
[ https://issues.apache.org/jira/browse/MESOS-4965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16388552#comment-16388552 ] Zhitao Li edited comment on MESOS-4965 at 3/6/18 9:24 PM: -- WIP [design doc|https://docs.google.com/document/d/1Z16okNG8mlf2eA6NyW_PUmBfNFs_6EOaPzPtwYNVQUQ/edit#] (mostly gather information) was (Author: zhitao): WIP[ design doc|https://docs.google.com/document/d/1Z16okNG8mlf2eA6NyW_PUmBfNFs_6EOaPzPtwYNVQUQ/edit#] (mostly gather information) > Support resizing of an existing persistent volume > - > > Key: MESOS-4965 > URL: https://issues.apache.org/jira/browse/MESOS-4965 > Project: Mesos > Issue Type: Improvement > Components: storage >Reporter: Zhitao Li >Assignee: Zhitao Li >Priority: Major > Labels: mesosphere, persistent-volumes, storage > > We need a mechanism to update the size of a persistent volume. > The increase case is generally more interesting to us (as long as there still > available disk resource on the same disk). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-4965) Support resizing of an existing persistent volume
[ https://issues.apache.org/jira/browse/MESOS-4965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16388552#comment-16388552 ] Zhitao Li commented on MESOS-4965: -- WIP[ design doc|https://docs.google.com/document/d/1Z16okNG8mlf2eA6NyW_PUmBfNFs_6EOaPzPtwYNVQUQ/edit#] (mostly gather information) > Support resizing of an existing persistent volume > - > > Key: MESOS-4965 > URL: https://issues.apache.org/jira/browse/MESOS-4965 > Project: Mesos > Issue Type: Improvement > Components: storage >Reporter: Zhitao Li >Assignee: Zhitao Li >Priority: Major > Labels: mesosphere, persistent-volumes, storage > > We need a mechanism to update the size of a persistent volume. > The increase case is generally more interesting to us (as long as there still > available disk resource on the same disk). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8530) Default executor tasks can get stuck in KILLING state
[ https://issues.apache.org/jira/browse/MESOS-8530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16388262#comment-16388262 ] Gastón Kleiman commented on MESOS-8530: --- [~kaysoky] hey, do you think you'll have time to review the chain this week? > Default executor tasks can get stuck in KILLING state > - > > Key: MESOS-8530 > URL: https://issues.apache.org/jira/browse/MESOS-8530 > Project: Mesos > Issue Type: Bug > Components: executor >Affects Versions: 1.2.3, 1.3.1, 1.4.1, 1.5.0 >Reporter: Gastón Kleiman >Assignee: Gastón Kleiman >Priority: Critical > Labels: default-executor, mesosphere > > The default executor will transition a task to {{TASK_KILLING}} and mark its > container as being killed before issuing the {{KILL_NESTED_CONTAINER}} call. > If the kill call fails, the task will get stuck in {{TASK_KILLING}}, and the > executor won't allow retrying the kill. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8641) New heartbeat on event stream could change the behavior for subscriber
[ https://issues.apache.org/jira/browse/MESOS-8641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16388094#comment-16388094 ] Zhitao Li commented on MESOS-8641: -- Attempt to fix: https://reviews.apache.org/r/65930 > New heartbeat on event stream could change the behavior for subscriber > -- > > Key: MESOS-8641 > URL: https://issues.apache.org/jira/browse/MESOS-8641 > Project: Mesos > Issue Type: Bug > Components: HTTP API >Affects Versions: 1.5.0 >Reporter: Zhitao Li >Assignee: Zhitao Li >Priority: Major > > A new event for heartbeat is added in > [MESOS-7695|https://reviews.apache.org/r/61262/bugs/MESOS-7695/], but I > believe the implementation in [https://reviews.apache.org/r/61262/] can > trigger a corner case and send *_HEARTBEAT_* before _*SUBSCRIBED*_ > > I would consider this a behavior change for the customer and I propose we > change the order as I suggest in the review to preserve previous behavior > (since the subscriber needs to see the _*SUBSCRIBED*_ event to really know > how it should respond to *_HEARTBEAT_* message anyway) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8382) Master should bookkeep local resource providers.
[ https://issues.apache.org/jira/browse/MESOS-8382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16388055#comment-16388055 ] Benjamin Bannier commented on MESOS-8382: - {noformat} commit 0d247c3887ea08b6273992218cd5899010d89fed Author: Benjamin Bannier Date: Tue Mar 6 16:02:00 2018 +0100 Used proto UUID instead stout UUID internally for operation IDs. Review: https://reviews.apache.org/r/65588/ commit 4c4ee4575667e721f710cbf5a09ba3ec94001672 Author: Benjamin Bannier Date: Tue Mar 6 16:01:55 2018 +0100 Added hash function for mesos::UUID. Review: https://reviews.apache.org/r/65587/ {noformat} > Master should bookkeep local resource providers. > > > Key: MESOS-8382 > URL: https://issues.apache.org/jira/browse/MESOS-8382 > Project: Mesos > Issue Type: Task >Reporter: Jie Yu >Assignee: Benjamin Bannier >Priority: Major > Labels: mesosphere, storage > Original Estimate: 5m > Remaining Estimate: 5m > > This will simplify the handling of `UpdateSlaveMessage`. ALso, it'll simplify > the endpoint serving. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8124) PosixRLimitsIsolatorTest.TaskExceedingLimit is flaky.
[ https://issues.apache.org/jira/browse/MESOS-8124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16388048#comment-16388048 ] Benjamin Bannier commented on MESOS-8124: - Another failure from parallel test execution: {noformat} 1 out of 1903 tests failed [ RUN ] PosixRLimitsIsolatorTest.TaskExceedingLimit ../src/tests/containerizer/posix_rlimits_isolator_tests.cpp:342: Failure Expected: TASK_STARTING To be equal to: statusStarting->state() Which is: TASK_FAILED Ready ../src/tests/containerizer/posix_rlimits_isolator_tests.cpp:344: Failure Failed to wait 15secs for statusRunning ../src/tests/containerizer/posix_rlimits_isolator_tests.cpp:333: Failure Actual function call count doesn't match EXPECT_CALL(sched, statusUpdate(&driver, _))... Expected: to be called 3 times Actual: called once - unsatisfied and active [ FAILED ] PosixRLimitsIsolatorTest.TaskExceedingLimit (16986 ms) {noformat} It might make sense to think of a not time-based limit to reduce this kind of flakiness. > PosixRLimitsIsolatorTest.TaskExceedingLimit is flaky. > - > > Key: MESOS-8124 > URL: https://issues.apache.org/jira/browse/MESOS-8124 > Project: Mesos > Issue Type: Bug > Components: test >Reporter: Benjamin Mahler >Priority: Major > Labels: flaky-test > Attachments: failed.txt, success.txt > > > This test fails flaky on CI: > {noformat} > ../../src/tests/containerizer/posix_rlimits_isolator_tests.cpp:348: Failure > Failed to wait 15secs for statusFailed > ../../src/tests/containerizer/posix_rlimits_isolator_tests.cpp:333: Failure > Actual function call count doesn't match EXPECT_CALL(sched, > statusUpdate(&driver, _))... > Expected: to be called 3 times >Actual: called twice - unsatisfied and active > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-8642) ballon-executor is hard to run as unprivileged user
Benjamin Bannier created MESOS-8642: --- Summary: ballon-executor is hard to run as unprivileged user Key: MESOS-8642 URL: https://issues.apache.org/jira/browse/MESOS-8642 Project: Mesos Issue Type: Bug Environment: The {{balloon-executor}} currently requires the ability to {{mlock}} large amounts of memory in order to prevent swapping. Since the amount of memory users can {{mlock}} is controlled by a rlimits this can make it harder than needed to run this executor as an unprivileged user. It should at least be possible to drop the {{mlock}}'ing completely if the host system uses no swap. Reporter: Benjamin Bannier Assignee: Benjamin Bannier -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8593) Support credential updates in Docker config without restarting the agent
[ https://issues.apache.org/jira/browse/MESOS-8593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16387683#comment-16387683 ] Kshitiz Bakshi commented on MESOS-8593: --- Hi, We're users of DC/OS Community edition, and hence our interface for Mesos is via Marathon. DC/OS Community Edition does not provide the support to use different image pull secrets for each task. In Docker containerizer case, everything already works by changing the creds file because Docker reads credentials on every image pull. We have our tooling to refresh the credentials. Migrating to UCR is blocked for us because of this issue, as mesos-agent does not read the passed file on each image pull. > Support credential updates in Docker config without restarting the agent > > > Key: MESOS-8593 > URL: https://issues.apache.org/jira/browse/MESOS-8593 > Project: Mesos > Issue Type: Improvement > Components: containerization, docker >Reporter: Jan Schlicht >Priority: Major > > When using the Mesos containerizer with a private Docker repository with > {{--docker_config}} option, the repository might expire credentials after > some time, forcing the user to login again. In that case the Docker config in > use will change and the agent needs to be restarted to reflect the change. > Instead of restarting, the agent could reload the Docker config file every > time before fetching. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8497) Docker parameter `name` does not work with Docker Containerizer.
[ https://issues.apache.org/jira/browse/MESOS-8497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16387460#comment-16387460 ] Qian Zhang commented on MESOS-8497: --- RR: https://reviews.apache.org/r/65918/ > Docker parameter `name` does not work with Docker Containerizer. > > > Key: MESOS-8497 > URL: https://issues.apache.org/jira/browse/MESOS-8497 > Project: Mesos > Issue Type: Bug > Components: containerization >Reporter: Jörg Schad >Assignee: Qian Zhang >Priority: Critical > Labels: containerizer > Attachments: agent.log, master.log > > > When deploying a marathon app with Docker Containerizer (need to check Mesos > Containerizer) and the parameter name set, Mesos is not able to > recognize/control/kill the started container. > Steps to reproduce > # Deploy the below marathon app definition > # Watch task being stuck in staging and mesos not being able to kill > it/communicate with it > ## > {quote}e.g., Agent Logs: W0126 18:38:50.00 4988 slave.cpp:6750] Failed > to get resource statistics for executor > ‘instana-agent.1a1f8d22-02c8-11e8-b607-923c3c523109’ of framework > 41f1b534-5f9d-4b5e-bb74-a0e387d5739f-0001: Failed to run ‘docker -H > unix:///var/run/docker.sock inspect > mesos-1c6f894d-9a3e-408c-8146-47ebab2f28be’: exited with status 1; > stderr=’Error: No such image, container or task: > mesos-1c6f894d-9a3e-408c-8146-47ebab2f28be{quote} > # Check on node and see container running, but not being recognized by mesos > {noformat} > { > "id": "/docker-test", > "instances": 1, > "portDefinitions": [], > "container": { > "type": "DOCKER", > "volumes": [], > "docker": { > "image": "ubuntu:16.04", > "parameters": [ > { > "key": "name", > "value": "myname" > } > ] > } > }, > "cpus": 0.1, > "mem": 128, > "requirePorts": false, > "networks": [], > "healthChecks": [], > "fetch": [], > "constraints": [], > "cmd": "sleep 1000" > } > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)