[jira] [Created] (MESOS-6498) Broken links in authorization documentation
Vinod Kone created MESOS-6498: - Summary: Broken links in authorization documentation Key: MESOS-6498 URL: https://issues.apache.org/jira/browse/MESOS-6498 Project: Mesos Issue Type: Bug Components: documentation Reporter: Vinod Kone Looks like a bunch of links in the authorization doc need to be re-written. https://validator.w3.org/checklink?uri=http%3A%2F%2Fmesos.apache.org%2Fdocumentation%2Flatest%2Fauthorization%2F&hide_type=all&depth=&check=Check -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6489) Better support for containers that want to manage their own cgroup.
[ https://issues.apache.org/jira/browse/MESOS-6489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15613942#comment-15613942 ] Yan Xu commented on MESOS-6489: --- A slightly different proposal for discussion: h4. Part one Give {{Future cgroups::destroy(const string& hierarchy, const string& cgroup)}} a {{bool continueOnError = true}} argument. In it: 1. Extract all cgroups (including sub cgroups) in bottom up fashion via {{cgroups::get(hierarchy, cgroup)}} 2. Hand over things to the {{Destroyer}} (i.e., don't remove cgroups in {{cgroups::destroy}} itself. Let {{Destroyer}} abstract this part out, it still does its job without a freezer subsystem) 3. Destroyer first tries to kill tasks. If freezer subsystem is available, it'll kill the tasks using TaskKillers; if not this step is a noop. 4. TaskKiller for an individual cgroup will fail if it the cgroup is missing. It's OK if we let it fail. We don't need to change the logic here. 5. The Destroyer is given a {{bool continueOnError}} mode, if {{true}}, it {{awaits}} for all futures (instead of {{collect}}) and tries to remove cgroups recursively regardless of failed futures. This is safe because if there are running tasks in a cgroup, {{remove()}} would just fail. We are also not singling out "if somebody has removed cgroup in a race condition" but rather giving Destroyer a more aggressive mode. If Docker removed a nested cgroup and caused an error, we still propagate the error after a more aggressive cleanup which still ends up destroying everything. The caller can decide how to deal with it the error. h4. Part two {{repair}} the Future returned by {{cgroups::destroy(...)}} in {{LinuxLauncherProcess::destroy(...)}}: With the above, the "hacky" part is left to {{Future LinuxLauncherProcess::destroy(const ContainerID& containerId)}} where you can repair the destroy: {code:title=} return cgroups::destroy( freezerHierarchy, cgroup(container->id), cgroups::DESTROY_TIMEOUT) .repair([](const Future& result) { // Comments explaining this. return cgroups::exists(cgroup(container->id)) ? result : Nothing(); }); {code} --- As a separate improvement, we can freeze the cgroups top-down by reversing the list returned by {{cgroups::get()}} when launching {{TaskKillers}}. {{TaskKillers}} currently in parallel but it should be OK. > Better support for containers that want to manage their own cgroup. > --- > > Key: MESOS-6489 > URL: https://issues.apache.org/jira/browse/MESOS-6489 > Project: Mesos > Issue Type: Improvement >Reporter: Jie Yu > > Some containers want to manage their cgroup by sub-dividing the cgroup that > Mesos allocates to them into multiple sub-cgroups and put subprocess into the > corresponding sub-cgroups. > For instance, someone wants to run Docker daemon in a Mesos container. Docker > daemon will manage the cgroup assigned to it by Mesos (with the help , for > example, cgroups namespace). > Problems arise during the teardown of the container because two entities > might be manipulating the same cgroup simultaneously. For example, the Mesos > cgroups::destroy might fail if the task running inside is trying to delete > the same nested cgroup at the same time. > To support that case, we should consider kill all the processes in the Mesos > cgroup first, making sure that no one will be creating sub-cgroups and moving > new processes into sub-cgroups. And then, destroy the cgroups recursively. > And we need freezer because we want to make sure all processes are stopped > while we are sending kill signals to avoid TOCTTOU race problem. I think it > makes more sense to freezer the cgroups (and sub-cgroups) from top down > (rather than bottom up because typically, processes in the parent cgroup > manipulate sub-cgroups). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6040) Add a CMake build for `mesos-port-mapper`
[ https://issues.apache.org/jira/browse/MESOS-6040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15613679#comment-15613679 ] Avinash Sridharan commented on MESOS-6040: -- Removing this from this sprint since I am traveling and won't have cycles to work on this in the next sprint. > Add a CMake build for `mesos-port-mapper` > - > > Key: MESOS-6040 > URL: https://issues.apache.org/jira/browse/MESOS-6040 > Project: Mesos > Issue Type: Task > Components: containerization >Reporter: Avinash Sridharan >Assignee: Avinash Sridharan >Priority: Blocker > Labels: mesosphere > > Once the port-mapper binary compiles with GNU make, we need to modify the > CMake to build the port-mapper binary as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6040) Add a CMake build for `mesos-port-mapper`
[ https://issues.apache.org/jira/browse/MESOS-6040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Avinash Sridharan updated MESOS-6040: - Sprint: Mesosphere Sprint 41, Mesosphere Sprint 42 (was: Mesosphere Sprint 41, Mesosphere Sprint 42, Mesosphere Sprint 45) > Add a CMake build for `mesos-port-mapper` > - > > Key: MESOS-6040 > URL: https://issues.apache.org/jira/browse/MESOS-6040 > Project: Mesos > Issue Type: Task > Components: containerization >Reporter: Avinash Sridharan >Assignee: Avinash Sridharan >Priority: Blocker > Labels: mesosphere > > Once the port-mapper binary compiles with GNU make, we need to modify the > CMake to build the port-mapper binary as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6400) Not able to remove Orphan Tasks
[ https://issues.apache.org/jira/browse/MESOS-6400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15613513#comment-15613513 ] Gilbert Song commented on MESOS-6400: - [~mithril], thanks for recording the logs. We will address all related tech debt in Mesos. BTW, you can resolve the orphan task issue by tearing down the unregistered marathon framework using the workaround in the following doc: https://gist.github.com/bernadinm/41bca6058f9137cd21f4fb562fd20d50 > Not able to remove Orphan Tasks > --- > > Key: MESOS-6400 > URL: https://issues.apache.org/jira/browse/MESOS-6400 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.0.1 > Environment: centos 7 x64 >Reporter: kasim >Priority: Critical > > The problem maybe cause by Mesos and Marathon out of sync > https://github.com/mesosphere/marathon/issues/616 > When I found Orphan Tasks happen, I > 1. restart marathon > 2. marathon do not sync Orphan Tasks, but start new tasks. > 3. Orphan Tasks still taked the resource, I have to delete them. > 4. I find all Orphan Tasks are under framework > `ef169d8a-24fc-41d1-8b0d-c67718937a48-`, > curl -XGET `http://c196:5050/master/frameworks` shows that framework is > `unregistered_frameworks` > {code} > { > "frameworks": [ > . > ], > "completed_frameworks": [ ], > "unregistered_frameworks": [ > "ef169d8a-24fc-41d1-8b0d-c67718937a48-", > "ef169d8a-24fc-41d1-8b0d-c67718937a48-", > "ef169d8a-24fc-41d1-8b0d-c67718937a48-" > ] > } > {code} > 5.Try {code}curl -XPOST http://c196:5050/master/teardown -d > 'frameworkId=ef169d8a-24fc-41d1-8b0d-c67718937a48-' {code} > , but get `No framework found with specified ID` > So I have no idea to delete Orphan Tasks -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-6497) HTTP Adapter does not surface MasterInfo.
[ https://issues.apache.org/jira/browse/MESOS-6497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15613287#comment-15613287 ] Anand Mazumdar edited comment on MESOS-6497 at 10/27/16 9:30 PM: - We decided to have an optional {{MasterInfo}} field in the {{SUBSCRIBED}} event thereby providing the schedulers with this information. Another option was adding it to the {{connected}} callback on the scheduler library but we punted on it because in the future schedulers might want to use their own detection library that might not read contents from Master ZK to populate {{MasterInfo}} correctly. was (Author: anandmazumdar): We decided to have an optional {{MasterInfo}} field in the {{SUBSCRIBED}} event thereby providing the schedulers with this information. Another option was adding it to the {{connected}} callback on the scheduler library but we punted on it because in the future schedulers might want to use their own detection library that might not read contents from Master ZK. > HTTP Adapter does not surface MasterInfo. > - > > Key: MESOS-6497 > URL: https://issues.apache.org/jira/browse/MESOS-6497 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.1.0 >Reporter: Joris Van Remoortere >Assignee: Anand Mazumdar >Priority: Blocker > Labels: mesosphere, v1_api > > The HTTP adapter does not surface the {{MasterInfo}}. This makes it not > compatible with the V0 API where the {{registered}} and {{reregistered}} > calls provided the MasterInfo to the framework. > cc [~vinodkone] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6497) HTTP Adapter does not surface MasterInfo.
[ https://issues.apache.org/jira/browse/MESOS-6497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15613287#comment-15613287 ] Anand Mazumdar commented on MESOS-6497: --- We decided to have an optional {{MasterInfo}} field in the {{SUBSCRIBED}} event thereby providing the schedulers with this information. Another option was adding it to the {{connected}} callback on the scheduler library but we punted on it because in the future schedulers might want to use their own detection library that might not read contents from Master ZK. > HTTP Adapter does not surface MasterInfo. > - > > Key: MESOS-6497 > URL: https://issues.apache.org/jira/browse/MESOS-6497 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.1.0 >Reporter: Joris Van Remoortere >Assignee: Anand Mazumdar >Priority: Blocker > Labels: mesosphere, v1_api > > The HTTP adapter does not surface the {{MasterInfo}}. This makes it not > compatible with the V0 API where the {{registered}} and {{reregistered}} > calls provided the MasterInfo to the framework. > cc [~vinodkone] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6497) HTTP Adapter does not surface MasterInfo.
[ https://issues.apache.org/jira/browse/MESOS-6497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anand Mazumdar updated MESOS-6497: -- Shepherd: Vinod Kone Description: The HTTP adapter does not surface the {{MasterInfo}}. This makes it not compatible with the V0 API where the {{registered}} and {{reregistered}} calls provided the MasterInfo to the framework. cc [~vinodkone] was: The HTTP adapter does not surface the MasterInfo. This makes it not compatible with the V0 API where the {{registered}} and {{reregistered}} calls provided the MasterInfo to the framework. cc [~vinodkone] Summary: HTTP Adapter does not surface MasterInfo. (was: HTTP Adapter does not surface MasterInfo) > HTTP Adapter does not surface MasterInfo. > - > > Key: MESOS-6497 > URL: https://issues.apache.org/jira/browse/MESOS-6497 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.1.0 >Reporter: Joris Van Remoortere >Assignee: Anand Mazumdar >Priority: Blocker > Labels: mesosphere, v1_api > > The HTTP adapter does not surface the {{MasterInfo}}. This makes it not > compatible with the V0 API where the {{registered}} and {{reregistered}} > calls provided the MasterInfo to the framework. > cc [~vinodkone] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6462) Design Doc: Mesos Support for Container Attach and Container Exec
[ https://issues.apache.org/jira/browse/MESOS-6462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kevin Klues updated MESOS-6462: --- Description: Here is a link to the design doc: https://docs.google.com/document/d/1nAVr0sSSpbDLrgUlAEB5hKzCl482NSVk8V0D56sFMzU It is not yet complete, but it is filled out enough to start eliciting feedback. Please feel free to add comments (or even add content!) as you wish. was: Here is a link to the design doc: https://docs.google.com/document/d/1nAVr0sSSpbDLrgUlAEB5hKzCl482NSVk8V0D56sFMzU/edit#heading=h.jcjim99nrfbv It is not yet complete, but it is filled out enough to start eliciting feedback. Please feel free to add comments (or even add content!) as you wish. > Design Doc: Mesos Support for Container Attach and Container Exec > - > > Key: MESOS-6462 > URL: https://issues.apache.org/jira/browse/MESOS-6462 > Project: Mesos > Issue Type: Task >Reporter: Kevin Klues >Assignee: Kevin Klues > Labels: debugging, mesosphere > > Here is a link to the design doc: > https://docs.google.com/document/d/1nAVr0sSSpbDLrgUlAEB5hKzCl482NSVk8V0D56sFMzU > It is not yet complete, but it is filled out enough to start eliciting > feedback. Please feel free to add comments (or even add content!) as you wish. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6462) Design Doc: Mesos Support for Container Attach and Container Exec
[ https://issues.apache.org/jira/browse/MESOS-6462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kevin Klues updated MESOS-6462: --- Description: Here is a link to the design doc: https://docs.google.com/document/d/1nAVr0sSSpbDLrgUlAEB5hKzCl482NSVk8V0D56sFMzU/edit#heading=h.jcjim99nrfbv It is not yet complete, but it is filled out enough to start eliciting feedback. Please feel free to add comments (or even add content!) as you wish. > Design Doc: Mesos Support for Container Attach and Container Exec > - > > Key: MESOS-6462 > URL: https://issues.apache.org/jira/browse/MESOS-6462 > Project: Mesos > Issue Type: Task >Reporter: Kevin Klues >Assignee: Kevin Klues > Labels: debugging, mesosphere > > Here is a link to the design doc: > https://docs.google.com/document/d/1nAVr0sSSpbDLrgUlAEB5hKzCl482NSVk8V0D56sFMzU/edit#heading=h.jcjim99nrfbv > It is not yet complete, but it is filled out enough to start eliciting > feedback. Please feel free to add comments (or even add content!) as you wish. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6446) WebUI redirect doesn't work with stats from /metric/snapshot
[ https://issues.apache.org/jira/browse/MESOS-6446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kone updated MESOS-6446: -- Fix Version/s: 1.0.2 Cherry-picked for 1.0.2 commit ec315f28e6f86813af3b756be190e1b48c21404d Author: Vinod Kone Date: Thu Oct 27 10:28:29 2016 -0700 Added MESOS-6446 to CHANGELOG for 1.0.2. commit 1ca4db714fd0acc6095a7a0e14c373a3775df528 Author: haosdent huang Date: Thu Oct 27 10:22:18 2016 -0700 Fixed the broken metrics information of master in WebUI. After we introduced redirection on `/master/state` endpoint to the leading master in `c9153336`, the metrics information in the WebUI was broken when the current master is not the leading master. In this patch, we retrieve the leading master from `/master/state` endpoint and ensure that requests to `/metrics/snapshot` and `/state` endpoints are always sent to the leading master. Review: https://reviews.apache.org/r/53172/ commit ef90134ccbcd3239241a6d5571aaaf0192e1c294 Author: haosdent huang Date: Thu Oct 27 10:22:09 2016 -0700 Show the leading master's information in `/master/state` endpoint. Review: https://reviews.apache.org/r/53193/ > WebUI redirect doesn't work with stats from /metric/snapshot > > > Key: MESOS-6446 > URL: https://issues.apache.org/jira/browse/MESOS-6446 > Project: Mesos > Issue Type: Bug > Components: webui >Affects Versions: 1.0.0 >Reporter: Yan Xu >Assignee: haosdent >Priority: Blocker > Fix For: 1.0.2, 1.2.0 > > Attachments: Screen Shot 2016-10-21 at 12.04.23 PM.png, > webui_metrics.gif > > > After Mesos 1.0, the webUI redirect is hidden from the users so you can go to > any of the master and the webUI is populated with state.json from the leading > master. > This doesn't include stats from /metric/snapshot though as it is not > redirected. The user ends up seeing some fields with empty values. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6446) WebUI redirect doesn't work with stats from /metric/snapshot
[ https://issues.apache.org/jira/browse/MESOS-6446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kone updated MESOS-6446: -- Fix Version/s: 1.1.0 Cherry-picked for 1.1.0 commit e69f819fc996f4c328a8968131a1e807a0692bf1 Author: Vinod Kone Date: Thu Oct 27 13:29:33 2016 -0700 Added MESOS-6446 to 1.1.0 CHANGELOG. commit 4ac4916a39a6beb81fab0c8d7d72fc6c06e2e650 Author: haosdent huang Date: Thu Oct 27 10:22:18 2016 -0700 Fixed the broken metrics information of master in WebUI. After we introduced redirection on `/master/state` endpoint to the leading master in `c9153336`, the metrics information in the WebUI was broken when the current master is not the leading master. In this patch, we retrieve the leading master from `/master/state` endpoint and ensure that requests to `/metrics/snapshot` and `/state` endpoints are always sent to the leading master. Review: https://reviews.apache.org/r/53172/ commit 0d747295cbcb897f245ef209a7760f0fad558a35 Author: haosdent huang Date: Thu Oct 27 10:22:09 2016 -0700 Show the leading master's information in `/master/state` endpoint. Review: https://reviews.apache.org/r/53193/ > WebUI redirect doesn't work with stats from /metric/snapshot > > > Key: MESOS-6446 > URL: https://issues.apache.org/jira/browse/MESOS-6446 > Project: Mesos > Issue Type: Bug > Components: webui >Affects Versions: 1.0.0 >Reporter: Yan Xu >Assignee: haosdent >Priority: Blocker > Fix For: 1.0.2, 1.1.0, 1.2.0 > > Attachments: Screen Shot 2016-10-21 at 12.04.23 PM.png, > webui_metrics.gif > > > After Mesos 1.0, the webUI redirect is hidden from the users so you can go to > any of the master and the webUI is populated with state.json from the leading > master. > This doesn't include stats from /metric/snapshot though as it is not > redirected. The user ends up seeing some fields with empty values. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-6497) HTTP Adapter does not surface MasterInfo
Joris Van Remoortere created MESOS-6497: --- Summary: HTTP Adapter does not surface MasterInfo Key: MESOS-6497 URL: https://issues.apache.org/jira/browse/MESOS-6497 Project: Mesos Issue Type: Bug Affects Versions: 1.1.0 Reporter: Joris Van Remoortere Assignee: Anand Mazumdar Priority: Blocker The HTTP adapter does not surface the MasterInfo. This makes it not compatible with the V0 API where the {{registered}} and {{reregistered}} calls provided the MasterInfo to the framework. cc [~vinodkone] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (MESOS-6372) Improvements to shared resources
[ https://issues.apache.org/jira/browse/MESOS-6372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anindya Sinha reassigned MESOS-6372: Assignee: Anindya Sinha > Improvements to shared resources > > > Key: MESOS-6372 > URL: https://issues.apache.org/jira/browse/MESOS-6372 > Project: Mesos > Issue Type: Epic >Reporter: Yan Xu >Assignee: Anindya Sinha > > This is a follow up epic to MESOS-3421 to capture further improvements and > changes that need to be made to the MVP. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-5792) Add mesos tests to CMake (make check)
[ https://issues.apache.org/jira/browse/MESOS-5792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph Wu updated MESOS-5792: - Sprint: Mesosphere Sprint 40, Mesosphere Sprint 41, Mesosphere Sprint 42, Mesosphere Sprint 44 (was: Mesosphere Sprint 40, Mesosphere Sprint 41, Mesosphere Sprint 42, Mesosphere Sprint 44, Mesosphere Sprint 45) > Add mesos tests to CMake (make check) > - > > Key: MESOS-5792 > URL: https://issues.apache.org/jira/browse/MESOS-5792 > Project: Mesos > Issue Type: Improvement > Components: build >Reporter: Srinivas >Assignee: Srinivas > Labels: build, mesosphere > Original Estimate: 168h > Remaining Estimate: 168h > > Provide CMakeLists.txt and configuration files to build mesos tests using > CMake. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-6496) Support up-casting of Shared and Owned
Neil Conway created MESOS-6496: -- Summary: Support up-casting of Shared and Owned Key: MESOS-6496 URL: https://issues.apache.org/jira/browse/MESOS-6496 Project: Mesos Issue Type: Bug Components: libprocess Reporter: Neil Conway It should be possible to pass a {{Shared}} value to an object that takes a parameter of type {{Shared}}. Similarly for {{Owned}}. In general, {{Shared}} should be implicitly convertable to {{Shared}} iff {{T2}} is implicitly convertable to {{T1}}. In C++11, this works because they define the appropriate conversion constructor. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6446) WebUI redirect doesn't work with stats from /metric/snapshot
[ https://issues.apache.org/jira/browse/MESOS-6446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] haosdent updated MESOS-6446: Shepherd: Vinod Kone > WebUI redirect doesn't work with stats from /metric/snapshot > > > Key: MESOS-6446 > URL: https://issues.apache.org/jira/browse/MESOS-6446 > Project: Mesos > Issue Type: Bug > Components: webui >Affects Versions: 1.0.0 >Reporter: Yan Xu >Assignee: haosdent >Priority: Blocker > Fix For: 1.2.0 > > Attachments: Screen Shot 2016-10-21 at 12.04.23 PM.png, > webui_metrics.gif > > > After Mesos 1.0, the webUI redirect is hidden from the users so you can go to > any of the master and the webUI is populated with state.json from the leading > master. > This doesn't include stats from /metric/snapshot though as it is not > redirected. The user ends up seeing some fields with empty values. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6212) Validate the name format of mesos-managed docker containers
[ https://issues.apache.org/jira/browse/MESOS-6212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15612536#comment-15612536 ] Anand Mazumdar commented on MESOS-6212: --- Keeping the JIRA open till I complete the backport to 1.0.2. > Validate the name format of mesos-managed docker containers > --- > > Key: MESOS-6212 > URL: https://issues.apache.org/jira/browse/MESOS-6212 > Project: Mesos > Issue Type: Improvement > Components: containerization >Affects Versions: 1.0.1 >Reporter: Marc Villacorta >Assignee: Manuwela Kanade > Fix For: 1.1.0 > > > Validate the name format of mesos-managed docker containers in order to avoid > false positives when looking for orphaned mesos tasks. > Currently names such as _'mesos-master'_, _'mesos-agent'_ and _'mesos-dns'_ > are wrongly terminated when {{--docker_kill_orphans}} is set to true > (default). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6212) Validate the name format of mesos-managed docker containers
[ https://issues.apache.org/jira/browse/MESOS-6212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anand Mazumdar updated MESOS-6212: -- Target Version/s: 1.0.2, 1.1.0 (was: 1.0.2) Fix Version/s: (was: 1.0.2) 1.1.0 > Validate the name format of mesos-managed docker containers > --- > > Key: MESOS-6212 > URL: https://issues.apache.org/jira/browse/MESOS-6212 > Project: Mesos > Issue Type: Improvement > Components: containerization >Affects Versions: 1.0.1 >Reporter: Marc Villacorta >Assignee: Manuwela Kanade > Fix For: 1.1.0 > > > Validate the name format of mesos-managed docker containers in order to avoid > false positives when looking for orphaned mesos tasks. > Currently names such as _'mesos-master'_, _'mesos-agent'_ and _'mesos-dns'_ > are wrongly terminated when {{--docker_kill_orphans}} is set to true > (default). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6458) Add test to check fromString function of stout library
[ https://issues.apache.org/jira/browse/MESOS-6458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anand Mazumdar updated MESOS-6458: -- Target Version/s: (was: 1.0.2) Fix Version/s: 1.1.0 > Add test to check fromString function of stout library > -- > > Key: MESOS-6458 > URL: https://issues.apache.org/jira/browse/MESOS-6458 > Project: Mesos > Issue Type: Improvement > Components: stout >Affects Versions: 1.0.1 >Reporter: Manuwela Kanade >Assignee: Manuwela Kanade >Priority: Trivial > Fix For: 1.1.0 > > > For the 3rdparty stout library, there is a testcase for checking Malformed > UUID. > But this testcase does not have a positive test for the fromString function > to test if it returns correct UUID when passed a correctly formatted UUID > string. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6495) Create metrics for HTTP API endpoint response codes.
[ https://issues.apache.org/jira/browse/MESOS-6495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhitao Li updated MESOS-6495: - Summary: Create metrics for HTTP API endpoint response codes. (was: Create metrics for HTTP API endpoint) > Create metrics for HTTP API endpoint response codes. > > > Key: MESOS-6495 > URL: https://issues.apache.org/jira/browse/MESOS-6495 > Project: Mesos > Issue Type: Improvement >Reporter: Zhitao Li > > We should have some metrics about various response code for (scheduler) HTTP > API (2xx, 4xx, etc) > [~anandmazumdar] suggested that ideally the solution could be easily extended > to cover other endpoints if we can directly enhance libprocess, so we can > cover other API (Master/Agent). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-6495) Create metrics for HTTP API endpoint
Zhitao Li created MESOS-6495: Summary: Create metrics for HTTP API endpoint Key: MESOS-6495 URL: https://issues.apache.org/jira/browse/MESOS-6495 Project: Mesos Issue Type: Improvement Reporter: Zhitao Li We should have some metrics about various response code for (scheduler) HTTP API (2xx, 4xx, etc) [~anandmazumdar] suggested that ideally the solution could be easily extended to cover other endpoints if we can directly enhance libprocess, so we can cover other API (Master/Agent). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6494) Clean up the flags parsing in the executors
[ https://issues.apache.org/jira/browse/MESOS-6494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15612390#comment-15612390 ] Gastón Kleiman commented on MESOS-6494: --- Patches in the chain starting with: https://reviews.apache.org/r/52878/ > Clean up the flags parsing in the executors > --- > > Key: MESOS-6494 > URL: https://issues.apache.org/jira/browse/MESOS-6494 > Project: Mesos > Issue Type: Improvement >Reporter: Gastón Kleiman >Assignee: Gastón Kleiman > > The current executors and the executor libraries use a mix of `stout::flags` > and `os::getenv` to parse flags, leading to a lot of unnecessary and > sometimes duplicated code. > This should be cleaned up, using only {{stout::flags}} to parse flags. > Environment variables should be used for the flags that are common to ALL the > executors (listed in the Executor HTTP API doc). > Command line parameters should be used for flags that apply only to > individual executors. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-6494) Clean up the flags parsing in the executors
Gastón Kleiman created MESOS-6494: - Summary: Clean up the flags parsing in the executors Key: MESOS-6494 URL: https://issues.apache.org/jira/browse/MESOS-6494 Project: Mesos Issue Type: Improvement Reporter: Gastón Kleiman Assignee: Gastón Kleiman The current executors and the executor libraries use a mix of `stout::flags` and `os::getenv` to parse flags, leading to a lot of unnecessary and sometimes duplicated code. This should be cleaned up, using only {{stout::flags}} to parse flags. Environment variables should be used for the flags that are common to ALL the executors (listed in the Executor HTTP API doc). Command line parameters should be used for flags that apply only to individual executors. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6212) Validate the name format of mesos-managed docker containers
[ https://issues.apache.org/jira/browse/MESOS-6212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anand Mazumdar updated MESOS-6212: -- Shepherd: Timothy Chen (was: Anand Mazumdar) > Validate the name format of mesos-managed docker containers > --- > > Key: MESOS-6212 > URL: https://issues.apache.org/jira/browse/MESOS-6212 > Project: Mesos > Issue Type: Improvement > Components: containerization >Affects Versions: 1.0.1 >Reporter: Marc Villacorta >Assignee: Manuwela Kanade > Fix For: 1.0.2 > > > Validate the name format of mesos-managed docker containers in order to avoid > false positives when looking for orphaned mesos tasks. > Currently names such as _'mesos-master'_, _'mesos-agent'_ and _'mesos-dns'_ > are wrongly terminated when {{--docker_kill_orphans}} is set to true > (default). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-6493) Add test cases for the HTTPS health checks.
haosdent created MESOS-6493: --- Summary: Add test cases for the HTTPS health checks. Key: MESOS-6493 URL: https://issues.apache.org/jira/browse/MESOS-6493 Project: Mesos Issue Type: Task Components: tests Reporter: haosdent Assignee: haosdent Fix For: 1.2.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6492) Deprecate the existing `SSL_` env variables
[ https://issues.apache.org/jira/browse/MESOS-6492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gastón Kleiman updated MESOS-6492: -- Target Version/s: 1.2.0 > Deprecate the existing `SSL_` env variables > --- > > Key: MESOS-6492 > URL: https://issues.apache.org/jira/browse/MESOS-6492 > Project: Mesos > Issue Type: Task > Components: libprocess >Reporter: Gastón Kleiman > > `SSL_` env variables are deprecated by `LIBPROCES_SSL_`. > Cleanup the code once the deprecation cycle is over. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6327) Large docker images causes container launch failures: Too many levels of symbolic links
[ https://issues.apache.org/jira/browse/MESOS-6327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15612156#comment-15612156 ] Rogier Dikkes commented on MESOS-6327: -- More information: Last week i created an docker image containing 21 layers which is based on ubuntu:16.04 containing a few packages, today i updated the image to remove a typo in it and the image increased 30MB in size (not layers). Now im running into the issue as above. imagename 0.2.7 be78f88bb96937 minutes ago 418.3 MB imagename 0.2.6 2022190ada2c7 days ago 391.9 MB Some years ago the lxc community ran into this too, back then it was autofs causing issues. I have ensured autofs and automount were not running on the hosts. > Large docker images causes container launch failures: Too many levels of > symbolic links > --- > > Key: MESOS-6327 > URL: https://issues.apache.org/jira/browse/MESOS-6327 > Project: Mesos > Issue Type: Bug > Components: containerization, docker >Affects Versions: 1.0.0, 1.0.1 > Environment: centos 7.2 (1511), ubuntu 14.04 (trusty). Replicated in > the Apache Aurora vagrant image >Reporter: Rogier Dikkes >Priority: Critical > > When deploying Mesos containers with large (6G+, 60+ layers) Docker images > the task crashes with the error: > Mesos agent logs: > E1007 08:40:12.954227 8117 slave.cpp:3976] Container > 'a1d759ae-5bc6-4c4e-ac03-717fbb8e5da4' for executor > 'thermos-www-data-devel-hello_docker_image-0-d42d2af6-6b44-4b2b-be95-e1ba93a6b365' > of framework df > c91a86-84b9-4539-a7be-4ace7b7b44a1- failed to start: Collect failed: > Collect failed: Failed to copy layer: cp: cannot stat > ‘/var/lib/mesos/provisioner/containers/a1d759ae-5bc6-4c4e-ac03-717fbb8e5da4/b > ackends/copy/rootfses/5f328f72-25d4-4a26-ac83-8d30bbc44e97/usr/share/zoneinfo/right/Asia/Urumqi’: > Too many levels of symbolic links > ... (complete pastebin: http://pastebin.com/umZ4Q5d1 ) > How to replicate: > Start the aurora vagrant image. Adjust the > /etc/mesos-slave/executor_registration_timeout to 5 mins. Adjust the file > /vagrant/examples/jobs/hello_docker_image.aurora to start a large Docker > image instead of the example. (you can use anldisr/jupyter:0.4 i created as a > test image, this is based upon the jupyter notebook stacks.). Create the job, > watch it fail after x number of minutes. > The mesos sandbox is empty. > Aurora errors i see: > 28 minutes ago - FAILED : Failed to launch container: Collect failed: Collect > failed: Failed to copy layer: cp: cannot stat > ‘/var/lib/mesos/provisioner/containers/93420a36-0e0c-4f04-b401-74c426c25686/backends/copy/rootfses/6e185a51-7174-4b0d-a305-42b634eb91bb/usr/share/zoneinfo/right/Asia/Urumqi’: > Too many levels of symbolic links cp: cannot stat ... > Too many levels of symbolic links ; Container destroyed while provisioning > images > (complete pastebin: http://pastebin.com/uecHYD5J ) > To rule out the image i started this and more images as a normal Docker > container. This works without issues. > Mesos flags related configured: > -appc_store_dir > /tmp/mesos/images/appc > -containerizers > docker,mesos > -executor_registration_timeout > 5mins > -image_providers > appc,docker > -image_provisioner_backend > copy > -isolation > filesystem/linux,docker/runtime > Affected Mesos versions tested: 1.0.1 & 1.0.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6492) Deprecate the existing `SSL_` env variables
[ https://issues.apache.org/jira/browse/MESOS-6492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gastón Kleiman updated MESOS-6492: -- Description: {{SSL_}} env variables are deprecated by {{LIBPROCES_SSL_}}. Cleanup the code once the deprecation cycle is over. was: `SSL_` env variables are deprecated by `LIBPROCES_SSL_`. Cleanup the code once the deprecation cycle is over. > Deprecate the existing `SSL_` env variables > --- > > Key: MESOS-6492 > URL: https://issues.apache.org/jira/browse/MESOS-6492 > Project: Mesos > Issue Type: Task > Components: libprocess >Reporter: Gastón Kleiman > > {{SSL_}} env variables are deprecated by {{LIBPROCES_SSL_}}. > Cleanup the code once the deprecation cycle is over. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-6327) Large docker images causes container launch failures: Too many levels of symbolic links
[ https://issues.apache.org/jira/browse/MESOS-6327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15612156#comment-15612156 ] Rogier Dikkes edited comment on MESOS-6327 at 10/27/16 3:08 PM: More information: Last week i created an docker image containing 21 layers which is based on ubuntu:16.04 containing a few packages, today i updated the image to remove a typo in it and the image increased 30MB in size (not layers) i suspect because of package updates. Now im running into the issue as above. imagename 0.2.7 be78f88bb96937 minutes ago 418.3 MB imagename 0.2.6 2022190ada2c7 days ago 391.9 MB Some years ago the lxc community ran into this too, back then it was autofs causing issues. I have ensured autofs and automount were not running on the hosts. was (Author: a-nldisr): More information: Last week i created an docker image containing 21 layers which is based on ubuntu:16.04 containing a few packages, today i updated the image to remove a typo in it and the image increased 30MB in size (not layers). Now im running into the issue as above. imagename 0.2.7 be78f88bb96937 minutes ago 418.3 MB imagename 0.2.6 2022190ada2c7 days ago 391.9 MB Some years ago the lxc community ran into this too, back then it was autofs causing issues. I have ensured autofs and automount were not running on the hosts. > Large docker images causes container launch failures: Too many levels of > symbolic links > --- > > Key: MESOS-6327 > URL: https://issues.apache.org/jira/browse/MESOS-6327 > Project: Mesos > Issue Type: Bug > Components: containerization, docker >Affects Versions: 1.0.0, 1.0.1 > Environment: centos 7.2 (1511), ubuntu 14.04 (trusty). Replicated in > the Apache Aurora vagrant image >Reporter: Rogier Dikkes >Priority: Critical > > When deploying Mesos containers with large (6G+, 60+ layers) Docker images > the task crashes with the error: > Mesos agent logs: > E1007 08:40:12.954227 8117 slave.cpp:3976] Container > 'a1d759ae-5bc6-4c4e-ac03-717fbb8e5da4' for executor > 'thermos-www-data-devel-hello_docker_image-0-d42d2af6-6b44-4b2b-be95-e1ba93a6b365' > of framework df > c91a86-84b9-4539-a7be-4ace7b7b44a1- failed to start: Collect failed: > Collect failed: Failed to copy layer: cp: cannot stat > ‘/var/lib/mesos/provisioner/containers/a1d759ae-5bc6-4c4e-ac03-717fbb8e5da4/b > ackends/copy/rootfses/5f328f72-25d4-4a26-ac83-8d30bbc44e97/usr/share/zoneinfo/right/Asia/Urumqi’: > Too many levels of symbolic links > ... (complete pastebin: http://pastebin.com/umZ4Q5d1 ) > How to replicate: > Start the aurora vagrant image. Adjust the > /etc/mesos-slave/executor_registration_timeout to 5 mins. Adjust the file > /vagrant/examples/jobs/hello_docker_image.aurora to start a large Docker > image instead of the example. (you can use anldisr/jupyter:0.4 i created as a > test image, this is based upon the jupyter notebook stacks.). Create the job, > watch it fail after x number of minutes. > The mesos sandbox is empty. > Aurora errors i see: > 28 minutes ago - FAILED : Failed to launch container: Collect failed: Collect > failed: Failed to copy layer: cp: cannot stat > ‘/var/lib/mesos/provisioner/containers/93420a36-0e0c-4f04-b401-74c426c25686/backends/copy/rootfses/6e185a51-7174-4b0d-a305-42b634eb91bb/usr/share/zoneinfo/right/Asia/Urumqi’: > Too many levels of symbolic links cp: cannot stat ... > Too many levels of symbolic links ; Container destroyed while provisioning > images > (complete pastebin: http://pastebin.com/uecHYD5J ) > To rule out the image i started this and more images as a normal Docker > container. This works without issues. > Mesos flags related configured: > -appc_store_dir > /tmp/mesos/images/appc > -containerizers > docker,mesos > -executor_registration_timeout > 5mins > -image_providers > appc,docker > -image_provisioner_backend > copy > -isolation > filesystem/linux,docker/runtime > Affected Mesos versions tested: 1.0.1 & 1.0.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6492) Deprecate the existing `SSL_` env variables
[ https://issues.apache.org/jira/browse/MESOS-6492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gastón Kleiman updated MESOS-6492: -- Description: {{SSL_}} env variables are deprecated by {{LIBPROCESS_SSL_}}. Cleanup the code once the deprecation cycle is over. was: {{SSL_}} env variables are deprecated by {{LIBPROCES_SSL_}}. Cleanup the code once the deprecation cycle is over. > Deprecate the existing `SSL_` env variables > --- > > Key: MESOS-6492 > URL: https://issues.apache.org/jira/browse/MESOS-6492 > Project: Mesos > Issue Type: Task > Components: libprocess >Reporter: Gastón Kleiman > > {{SSL_}} env variables are deprecated by {{LIBPROCESS_SSL_}}. > Cleanup the code once the deprecation cycle is over. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6458) Add test to check fromString function of stout library
[ https://issues.apache.org/jira/browse/MESOS-6458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Timothy Chen updated MESOS-6458: Shepherd: Timothy Chen > Add test to check fromString function of stout library > -- > > Key: MESOS-6458 > URL: https://issues.apache.org/jira/browse/MESOS-6458 > Project: Mesos > Issue Type: Improvement > Components: stout >Affects Versions: 1.0.1 >Reporter: Manuwela Kanade >Assignee: Manuwela Kanade >Priority: Trivial > > For the 3rdparty stout library, there is a testcase for checking Malformed > UUID. > But this testcase does not have a positive test for the fromString function > to test if it returns correct UUID when passed a correctly formatted UUID > string. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-6492) Deprecate the existing `SSL_` env variables
Gastón Kleiman created MESOS-6492: - Summary: Deprecate the existing `SSL_` env variables Key: MESOS-6492 URL: https://issues.apache.org/jira/browse/MESOS-6492 Project: Mesos Issue Type: Task Components: libprocess Reporter: Gastón Kleiman `SSL_` env variables are deprecated by `LIBPROCES_SSL_`. Cleanup the code once the deprecation cycle is over. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6293) HealthCheckTest.HealthyTaskViaHTTPWithoutType fails on some distros.
[ https://issues.apache.org/jira/browse/MESOS-6293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15612021#comment-15612021 ] haosdent commented on MESOS-6293: - As [~alexr] investigation result, the test case would fail when we set {{LIBPROCESS_IP}} in the environment because running test cases because master would bind to {{LIBPROCESS_IP}} and didn't listen on {{127.0.0.1}}. > HealthCheckTest.HealthyTaskViaHTTPWithoutType fails on some distros. > > > Key: MESOS-6293 > URL: https://issues.apache.org/jira/browse/MESOS-6293 > Project: Mesos > Issue Type: Bug >Reporter: Alexander Rukletsov >Assignee: Alexander Rukletsov > Labels: health-check, mesosphere > > I see consistent failures of this test in the internal CI in *some* distros, > specifically CentOS 6, Ubuntu 14, 15, 16. The source of the health check > failure is always the same: {{curl}} cannot connect to the target: > {noformat} > Received task health update, healthy: false > W0929 17:22:05.270992 2730 health_checker.cpp:204] Health check failed 1 > times consecutively: HTTP health check failed: curl returned exited with > status 7: curl: (7) couldn't connect to host > I0929 17:22:05.273634 26850 slave.cpp:3609] Handling status update > TASK_RUNNING (UUID: f5408ac9-f6ba-447f-b3d7-9dce44384ffe) for task > aa0792d3-8d85-4c32-bd04-56a9b552ebda in health state unhealthy of framework > 2e0e9ea1-0ae5-4f28-80bb-a9abc56c5a6f- from executor(1)@172.30.2.20:58660 > I0929 17:22:05.274178 26844 status_update_manager.cpp:323] Received status > update TASK_RUNNING (UUID: f5408ac9-f6ba-447f-b3d7-9dce44384ffe) for task > aa0792d3-8d85-4c32-bd04-56a9b552ebda in health state unhealthy of framework > 2e0e9ea1-0ae5-4f28-80bb-a9abc56c5a6f- > I0929 17:22:05.274226 26844 status_update_manager.cpp:377] Forwarding update > TASK_RUNNING (UUID: f5408ac9-f6ba-447f-b3d7-9dce44384ffe) for task > aa0792d3-8d85-4c32-bd04-56a9b552ebda in health state unhealthy of framework > 2e0e9ea1-0ae5-4f28-80bb-a9abc56c5a6f- to the agent > I0929 17:22:05.274314 26845 slave.cpp:4026] Forwarding the update > TASK_RUNNING (UUID: f5408ac9-f6ba-447f-b3d7-9dce44384ffe) for task > aa0792d3-8d85-4c32-bd04-56a9b552ebda in health state unhealthy of framework > 2e0e9ea1-0ae5-4f28-80bb-a9abc56c5a6f- to master@172.30.2.20:38955 > I0929 17:22:05.274415 26845 slave.cpp:3920] Status update manager > successfully handled status update TASK_RUNNING (UUID: > f5408ac9-f6ba-447f-b3d7-9dce44384ffe) for task > aa0792d3-8d85-4c32-bd04-56a9b552ebda in health state unhealthy of framework > 2e0e9ea1-0ae5-4f28-80bb-a9abc56c5a6f- > I0929 17:22:05.274436 26845 slave.cpp:3936] Sending acknowledgement for > status update TASK_RUNNING (UUID: f5408ac9-f6ba-447f-b3d7-9dce44384ffe) for > task aa0792d3-8d85-4c32-bd04-56a9b552ebda in health state unhealthy of > framework 2e0e9ea1-0ae5-4f28-80bb-a9abc56c5a6f- to > executor(1)@172.30.2.20:58660 > I0929 17:22:05.274534 26849 master.cpp:5661] Status update TASK_RUNNING > (UUID: f5408ac9-f6ba-447f-b3d7-9dce44384ffe) for task > aa0792d3-8d85-4c32-bd04-56a9b552ebda in health state unhealthy of framework > 2e0e9ea1-0ae5-4f28-80bb-a9abc56c5a6f- from agent > 2e0e9ea1-0ae5-4f28-80bb-a9abc56c5a6f-S0 at slave(77)@172.30.2.20:38955 > (ip-172-30-2-20.mesosphere.io) > ../../src/tests/health_check_tests.cpp:1398: Failure > I0929 17:22:05.274567 26849 master.cpp:5723] Forwarding status update > TASK_RUNNING (UUID: f5408ac9-f6ba-447f-b3d7-9dce44384ffe) for task > aa0792d3-8d85-4c32-bd04-56a9b552ebda in health state unhealthy of framework > 2e0e9ea1-0ae5-4f28-80bb-a9abc56c5a6f- > Value of: statusHealth.get().healthy() > Actual: false > Expected: true > I0929 17:22:05.274636 26849 master.cpp:7560] Updating the state of task > aa0792d3-8d85-4c32-bd04-56a9b552ebda of framework > 2e0e9ea1-0ae5-4f28-80bb-a9abc56c5a6f- (latest state: TASK_RUNNING, status > update state: TASK_RUNNING) > I0929 17:22:05.274829 26844 sched.cpp:1025] Scheduler::statusUpdate took > 43297ns > Received SHUTDOWN event > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6293) HealthCheckTest.HealthyTaskViaHTTPWithoutType fails on some distros.
[ https://issues.apache.org/jira/browse/MESOS-6293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] haosdent updated MESOS-6293: Assignee: Alexander Rukletsov (was: haosdent) > HealthCheckTest.HealthyTaskViaHTTPWithoutType fails on some distros. > > > Key: MESOS-6293 > URL: https://issues.apache.org/jira/browse/MESOS-6293 > Project: Mesos > Issue Type: Bug >Reporter: Alexander Rukletsov >Assignee: Alexander Rukletsov > Labels: health-check, mesosphere > > I see consistent failures of this test in the internal CI in *some* distros, > specifically CentOS 6, Ubuntu 14, 15, 16. The source of the health check > failure is always the same: {{curl}} cannot connect to the target: > {noformat} > Received task health update, healthy: false > W0929 17:22:05.270992 2730 health_checker.cpp:204] Health check failed 1 > times consecutively: HTTP health check failed: curl returned exited with > status 7: curl: (7) couldn't connect to host > I0929 17:22:05.273634 26850 slave.cpp:3609] Handling status update > TASK_RUNNING (UUID: f5408ac9-f6ba-447f-b3d7-9dce44384ffe) for task > aa0792d3-8d85-4c32-bd04-56a9b552ebda in health state unhealthy of framework > 2e0e9ea1-0ae5-4f28-80bb-a9abc56c5a6f- from executor(1)@172.30.2.20:58660 > I0929 17:22:05.274178 26844 status_update_manager.cpp:323] Received status > update TASK_RUNNING (UUID: f5408ac9-f6ba-447f-b3d7-9dce44384ffe) for task > aa0792d3-8d85-4c32-bd04-56a9b552ebda in health state unhealthy of framework > 2e0e9ea1-0ae5-4f28-80bb-a9abc56c5a6f- > I0929 17:22:05.274226 26844 status_update_manager.cpp:377] Forwarding update > TASK_RUNNING (UUID: f5408ac9-f6ba-447f-b3d7-9dce44384ffe) for task > aa0792d3-8d85-4c32-bd04-56a9b552ebda in health state unhealthy of framework > 2e0e9ea1-0ae5-4f28-80bb-a9abc56c5a6f- to the agent > I0929 17:22:05.274314 26845 slave.cpp:4026] Forwarding the update > TASK_RUNNING (UUID: f5408ac9-f6ba-447f-b3d7-9dce44384ffe) for task > aa0792d3-8d85-4c32-bd04-56a9b552ebda in health state unhealthy of framework > 2e0e9ea1-0ae5-4f28-80bb-a9abc56c5a6f- to master@172.30.2.20:38955 > I0929 17:22:05.274415 26845 slave.cpp:3920] Status update manager > successfully handled status update TASK_RUNNING (UUID: > f5408ac9-f6ba-447f-b3d7-9dce44384ffe) for task > aa0792d3-8d85-4c32-bd04-56a9b552ebda in health state unhealthy of framework > 2e0e9ea1-0ae5-4f28-80bb-a9abc56c5a6f- > I0929 17:22:05.274436 26845 slave.cpp:3936] Sending acknowledgement for > status update TASK_RUNNING (UUID: f5408ac9-f6ba-447f-b3d7-9dce44384ffe) for > task aa0792d3-8d85-4c32-bd04-56a9b552ebda in health state unhealthy of > framework 2e0e9ea1-0ae5-4f28-80bb-a9abc56c5a6f- to > executor(1)@172.30.2.20:58660 > I0929 17:22:05.274534 26849 master.cpp:5661] Status update TASK_RUNNING > (UUID: f5408ac9-f6ba-447f-b3d7-9dce44384ffe) for task > aa0792d3-8d85-4c32-bd04-56a9b552ebda in health state unhealthy of framework > 2e0e9ea1-0ae5-4f28-80bb-a9abc56c5a6f- from agent > 2e0e9ea1-0ae5-4f28-80bb-a9abc56c5a6f-S0 at slave(77)@172.30.2.20:38955 > (ip-172-30-2-20.mesosphere.io) > ../../src/tests/health_check_tests.cpp:1398: Failure > I0929 17:22:05.274567 26849 master.cpp:5723] Forwarding status update > TASK_RUNNING (UUID: f5408ac9-f6ba-447f-b3d7-9dce44384ffe) for task > aa0792d3-8d85-4c32-bd04-56a9b552ebda in health state unhealthy of framework > 2e0e9ea1-0ae5-4f28-80bb-a9abc56c5a6f- > Value of: statusHealth.get().healthy() > Actual: false > Expected: true > I0929 17:22:05.274636 26849 master.cpp:7560] Updating the state of task > aa0792d3-8d85-4c32-bd04-56a9b552ebda of framework > 2e0e9ea1-0ae5-4f28-80bb-a9abc56c5a6f- (latest state: TASK_RUNNING, status > update state: TASK_RUNNING) > I0929 17:22:05.274829 26844 sched.cpp:1025] Scheduler::statusUpdate took > 43297ns > Received SHUTDOWN event > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6293) HealthCheckTest.HealthyTaskViaHTTPWithoutType fails on some distros.
[ https://issues.apache.org/jira/browse/MESOS-6293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] haosdent updated MESOS-6293: Shepherd: (was: Alexander Rukletsov) > HealthCheckTest.HealthyTaskViaHTTPWithoutType fails on some distros. > > > Key: MESOS-6293 > URL: https://issues.apache.org/jira/browse/MESOS-6293 > Project: Mesos > Issue Type: Bug >Reporter: Alexander Rukletsov >Assignee: Alexander Rukletsov > Labels: health-check, mesosphere > > I see consistent failures of this test in the internal CI in *some* distros, > specifically CentOS 6, Ubuntu 14, 15, 16. The source of the health check > failure is always the same: {{curl}} cannot connect to the target: > {noformat} > Received task health update, healthy: false > W0929 17:22:05.270992 2730 health_checker.cpp:204] Health check failed 1 > times consecutively: HTTP health check failed: curl returned exited with > status 7: curl: (7) couldn't connect to host > I0929 17:22:05.273634 26850 slave.cpp:3609] Handling status update > TASK_RUNNING (UUID: f5408ac9-f6ba-447f-b3d7-9dce44384ffe) for task > aa0792d3-8d85-4c32-bd04-56a9b552ebda in health state unhealthy of framework > 2e0e9ea1-0ae5-4f28-80bb-a9abc56c5a6f- from executor(1)@172.30.2.20:58660 > I0929 17:22:05.274178 26844 status_update_manager.cpp:323] Received status > update TASK_RUNNING (UUID: f5408ac9-f6ba-447f-b3d7-9dce44384ffe) for task > aa0792d3-8d85-4c32-bd04-56a9b552ebda in health state unhealthy of framework > 2e0e9ea1-0ae5-4f28-80bb-a9abc56c5a6f- > I0929 17:22:05.274226 26844 status_update_manager.cpp:377] Forwarding update > TASK_RUNNING (UUID: f5408ac9-f6ba-447f-b3d7-9dce44384ffe) for task > aa0792d3-8d85-4c32-bd04-56a9b552ebda in health state unhealthy of framework > 2e0e9ea1-0ae5-4f28-80bb-a9abc56c5a6f- to the agent > I0929 17:22:05.274314 26845 slave.cpp:4026] Forwarding the update > TASK_RUNNING (UUID: f5408ac9-f6ba-447f-b3d7-9dce44384ffe) for task > aa0792d3-8d85-4c32-bd04-56a9b552ebda in health state unhealthy of framework > 2e0e9ea1-0ae5-4f28-80bb-a9abc56c5a6f- to master@172.30.2.20:38955 > I0929 17:22:05.274415 26845 slave.cpp:3920] Status update manager > successfully handled status update TASK_RUNNING (UUID: > f5408ac9-f6ba-447f-b3d7-9dce44384ffe) for task > aa0792d3-8d85-4c32-bd04-56a9b552ebda in health state unhealthy of framework > 2e0e9ea1-0ae5-4f28-80bb-a9abc56c5a6f- > I0929 17:22:05.274436 26845 slave.cpp:3936] Sending acknowledgement for > status update TASK_RUNNING (UUID: f5408ac9-f6ba-447f-b3d7-9dce44384ffe) for > task aa0792d3-8d85-4c32-bd04-56a9b552ebda in health state unhealthy of > framework 2e0e9ea1-0ae5-4f28-80bb-a9abc56c5a6f- to > executor(1)@172.30.2.20:58660 > I0929 17:22:05.274534 26849 master.cpp:5661] Status update TASK_RUNNING > (UUID: f5408ac9-f6ba-447f-b3d7-9dce44384ffe) for task > aa0792d3-8d85-4c32-bd04-56a9b552ebda in health state unhealthy of framework > 2e0e9ea1-0ae5-4f28-80bb-a9abc56c5a6f- from agent > 2e0e9ea1-0ae5-4f28-80bb-a9abc56c5a6f-S0 at slave(77)@172.30.2.20:38955 > (ip-172-30-2-20.mesosphere.io) > ../../src/tests/health_check_tests.cpp:1398: Failure > I0929 17:22:05.274567 26849 master.cpp:5723] Forwarding status update > TASK_RUNNING (UUID: f5408ac9-f6ba-447f-b3d7-9dce44384ffe) for task > aa0792d3-8d85-4c32-bd04-56a9b552ebda in health state unhealthy of framework > 2e0e9ea1-0ae5-4f28-80bb-a9abc56c5a6f- > Value of: statusHealth.get().healthy() > Actual: false > Expected: true > I0929 17:22:05.274636 26849 master.cpp:7560] Updating the state of task > aa0792d3-8d85-4c32-bd04-56a9b552ebda of framework > 2e0e9ea1-0ae5-4f28-80bb-a9abc56c5a6f- (latest state: TASK_RUNNING, status > update state: TASK_RUNNING) > I0929 17:22:05.274829 26844 sched.cpp:1025] Scheduler::statusUpdate took > 43297ns > Received SHUTDOWN event > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6293) HealthCheckTest.HealthyTaskViaHTTPWithoutType fails on some distros.
[ https://issues.apache.org/jira/browse/MESOS-6293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15612004#comment-15612004 ] Alexander Rukletsov commented on MESOS-6293: https://reviews.apache.org/r/53226/ > HealthCheckTest.HealthyTaskViaHTTPWithoutType fails on some distros. > > > Key: MESOS-6293 > URL: https://issues.apache.org/jira/browse/MESOS-6293 > Project: Mesos > Issue Type: Bug >Reporter: Alexander Rukletsov >Assignee: haosdent > Labels: health-check, mesosphere > > I see consistent failures of this test in the internal CI in *some* distros, > specifically CentOS 6, Ubuntu 14, 15, 16. The source of the health check > failure is always the same: {{curl}} cannot connect to the target: > {noformat} > Received task health update, healthy: false > W0929 17:22:05.270992 2730 health_checker.cpp:204] Health check failed 1 > times consecutively: HTTP health check failed: curl returned exited with > status 7: curl: (7) couldn't connect to host > I0929 17:22:05.273634 26850 slave.cpp:3609] Handling status update > TASK_RUNNING (UUID: f5408ac9-f6ba-447f-b3d7-9dce44384ffe) for task > aa0792d3-8d85-4c32-bd04-56a9b552ebda in health state unhealthy of framework > 2e0e9ea1-0ae5-4f28-80bb-a9abc56c5a6f- from executor(1)@172.30.2.20:58660 > I0929 17:22:05.274178 26844 status_update_manager.cpp:323] Received status > update TASK_RUNNING (UUID: f5408ac9-f6ba-447f-b3d7-9dce44384ffe) for task > aa0792d3-8d85-4c32-bd04-56a9b552ebda in health state unhealthy of framework > 2e0e9ea1-0ae5-4f28-80bb-a9abc56c5a6f- > I0929 17:22:05.274226 26844 status_update_manager.cpp:377] Forwarding update > TASK_RUNNING (UUID: f5408ac9-f6ba-447f-b3d7-9dce44384ffe) for task > aa0792d3-8d85-4c32-bd04-56a9b552ebda in health state unhealthy of framework > 2e0e9ea1-0ae5-4f28-80bb-a9abc56c5a6f- to the agent > I0929 17:22:05.274314 26845 slave.cpp:4026] Forwarding the update > TASK_RUNNING (UUID: f5408ac9-f6ba-447f-b3d7-9dce44384ffe) for task > aa0792d3-8d85-4c32-bd04-56a9b552ebda in health state unhealthy of framework > 2e0e9ea1-0ae5-4f28-80bb-a9abc56c5a6f- to master@172.30.2.20:38955 > I0929 17:22:05.274415 26845 slave.cpp:3920] Status update manager > successfully handled status update TASK_RUNNING (UUID: > f5408ac9-f6ba-447f-b3d7-9dce44384ffe) for task > aa0792d3-8d85-4c32-bd04-56a9b552ebda in health state unhealthy of framework > 2e0e9ea1-0ae5-4f28-80bb-a9abc56c5a6f- > I0929 17:22:05.274436 26845 slave.cpp:3936] Sending acknowledgement for > status update TASK_RUNNING (UUID: f5408ac9-f6ba-447f-b3d7-9dce44384ffe) for > task aa0792d3-8d85-4c32-bd04-56a9b552ebda in health state unhealthy of > framework 2e0e9ea1-0ae5-4f28-80bb-a9abc56c5a6f- to > executor(1)@172.30.2.20:58660 > I0929 17:22:05.274534 26849 master.cpp:5661] Status update TASK_RUNNING > (UUID: f5408ac9-f6ba-447f-b3d7-9dce44384ffe) for task > aa0792d3-8d85-4c32-bd04-56a9b552ebda in health state unhealthy of framework > 2e0e9ea1-0ae5-4f28-80bb-a9abc56c5a6f- from agent > 2e0e9ea1-0ae5-4f28-80bb-a9abc56c5a6f-S0 at slave(77)@172.30.2.20:38955 > (ip-172-30-2-20.mesosphere.io) > ../../src/tests/health_check_tests.cpp:1398: Failure > I0929 17:22:05.274567 26849 master.cpp:5723] Forwarding status update > TASK_RUNNING (UUID: f5408ac9-f6ba-447f-b3d7-9dce44384ffe) for task > aa0792d3-8d85-4c32-bd04-56a9b552ebda in health state unhealthy of framework > 2e0e9ea1-0ae5-4f28-80bb-a9abc56c5a6f- > Value of: statusHealth.get().healthy() > Actual: false > Expected: true > I0929 17:22:05.274636 26849 master.cpp:7560] Updating the state of task > aa0792d3-8d85-4c32-bd04-56a9b552ebda of framework > 2e0e9ea1-0ae5-4f28-80bb-a9abc56c5a6f- (latest state: TASK_RUNNING, status > update state: TASK_RUNNING) > I0929 17:22:05.274829 26844 sched.cpp:1025] Scheduler::statusUpdate took > 43297ns > Received SHUTDOWN event > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6279) Add test cases for the TCP health check.
[ https://issues.apache.org/jira/browse/MESOS-6279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Rukletsov updated MESOS-6279: --- Summary: Add test cases for the TCP health check. (was: Add test cases for the TCP health check) > Add test cases for the TCP health check. > > > Key: MESOS-6279 > URL: https://issues.apache.org/jira/browse/MESOS-6279 > Project: Mesos > Issue Type: Task > Components: tests >Reporter: haosdent >Assignee: haosdent > Labels: health-check, mesosphere, test > Fix For: 1.2.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-6446) WebUI redirect doesn't work with stats from /metric/snapshot
[ https://issues.apache.org/jira/browse/MESOS-6446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15611795#comment-15611795 ] haosdent edited comment on MESOS-6446 at 10/27/16 1:41 PM: --- Yes, [~vinodkone] is reviewing patches. was (Author: haosd...@gmail.com): Yes, [~vinodkone] are reviewing patches. > WebUI redirect doesn't work with stats from /metric/snapshot > > > Key: MESOS-6446 > URL: https://issues.apache.org/jira/browse/MESOS-6446 > Project: Mesos > Issue Type: Bug > Components: webui >Affects Versions: 1.0.0 >Reporter: Yan Xu >Assignee: haosdent >Priority: Blocker > Attachments: Screen Shot 2016-10-21 at 12.04.23 PM.png, > webui_metrics.gif > > > After Mesos 1.0, the webUI redirect is hidden from the users so you can go to > any of the master and the webUI is populated with state.json from the leading > master. > This doesn't include stats from /metric/snapshot though as it is not > redirected. The user ends up seeing some fields with empty values. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6484) Memory leak in `Future::after()`
[ https://issues.apache.org/jira/browse/MESOS-6484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Rojas updated MESOS-6484: --- Description: The problem arises when one tries to associate an {{after()}} call to copied futures. The following test case is enough to reproduce the issue: {code} TEST(FutureTest, After3) { auto policy = std::make_shared(0); { auto generator = []() { return Future(); }; Future future = generator() .after(Milliseconds(1), [policy](const Future&) { return Nothing(); }); AWAIT_READY(future); } EXPECT_EQ(1, policy.use_count()); } {code} In the test, one would expect that there is only one active reference to {{policy}}, therefore the expectation {{EXPECT_EQ(1, policy.use_count())}}. However, if after is triggered more than once, each extra call adds one undeleted reference to {{policy}}. was: The problem arises when one tries to associate an {{after()}} call to copied futures. The following test case is enough to reproduce the issue: {code} class Policy { public: virtual Try timeout() = 0; virtual Duration totalTimeout() = 0; virtual ~Policy() {} }; class MockPolicy : public Policy { public: virtual ~MockPolicy() {} MOCK_METHOD0(timeout, Try()); MOCK_METHOD0(totalTimeout, Duration()); }; template process::Future retry( const std::function()>& action, const std::shared_ptr& policy) { CHECK(policy != nullptr); Try timeout = policy->timeout(); if (timeout.isError()) { return Future::failed(timeout.error()); } return action() .after(timeout.get(), [action, policy](const Future&) { return retry(action, policy); }); } TEST(FutureTest, Retry) { auto policy = std::make_shared(); EXPECT_CALL(*policy, timeout()) .WillRepeatedly(Return(Milliseconds(1))); unsigned callCount = 0; auto future = retry([&callCount]() -> Future { ++callCount; if (callCount < 4) { return Future(); } return Nothing(); }, policy); AWAIT_READY(future); EXPECT_EQ(1, policy.use_count()); {code} In the test, one would expect that there is only one active reference to {{policy}}, therefore the expectation {{EXPECT_EQ(1, policy.use_count())}}. However, if after is triggered more than once, each extra call adds one undeleted reference to {{policy}}. > Memory leak in `Future::after()` > --- > > Key: MESOS-6484 > URL: https://issues.apache.org/jira/browse/MESOS-6484 > Project: Mesos > Issue Type: Bug > Components: libprocess >Affects Versions: 1.1.0 >Reporter: Alexander Rojas > Labels: libprocess, mesosphere > > The problem arises when one tries to associate an {{after()}} call to copied > futures. The following test case is enough to reproduce the issue: > {code} > TEST(FutureTest, After3) > { > auto policy = std::make_shared(0); > { > auto generator = []() { > return Future(); > }; > Future future = generator() > .after(Milliseconds(1), > [policy](const Future&) { >return Nothing(); > }); > AWAIT_READY(future); > } > EXPECT_EQ(1, policy.use_count()); > } > {code} > In the test, one would expect that there is only one active reference to > {{policy}}, therefore the expectation {{EXPECT_EQ(1, policy.use_count())}}. > However, if after is triggered more than once, each extra call adds one > undeleted reference to {{policy}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6446) WebUI redirect doesn't work with stats from /metric/snapshot
[ https://issues.apache.org/jira/browse/MESOS-6446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15611795#comment-15611795 ] haosdent commented on MESOS-6446: - Yes, [~vinodkone] are reviewing patches. > WebUI redirect doesn't work with stats from /metric/snapshot > > > Key: MESOS-6446 > URL: https://issues.apache.org/jira/browse/MESOS-6446 > Project: Mesos > Issue Type: Bug > Components: webui >Affects Versions: 1.0.0 >Reporter: Yan Xu >Assignee: haosdent >Priority: Blocker > Attachments: Screen Shot 2016-10-21 at 12.04.23 PM.png, > webui_metrics.gif > > > After Mesos 1.0, the webUI redirect is hidden from the users so you can go to > any of the master and the webUI is populated with state.json from the leading > master. > This doesn't include stats from /metric/snapshot though as it is not > redirected. The user ends up seeing some fields with empty values. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6278) Add test cases for the HTTP health checks.
[ https://issues.apache.org/jira/browse/MESOS-6278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Rukletsov updated MESOS-6278: --- Summary: Add test cases for the HTTP health checks. (was: Add test cases for the HTTP health checks) > Add test cases for the HTTP health checks. > -- > > Key: MESOS-6278 > URL: https://issues.apache.org/jira/browse/MESOS-6278 > Project: Mesos > Issue Type: Task > Components: tests >Reporter: haosdent >Assignee: haosdent > Labels: health-check, mesosphere, test > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6446) WebUI redirect doesn't work with stats from /metric/snapshot
[ https://issues.apache.org/jira/browse/MESOS-6446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15611654#comment-15611654 ] Till Toenshoff commented on MESOS-6446: --- [~vinodkone] are you shepherding this? > WebUI redirect doesn't work with stats from /metric/snapshot > > > Key: MESOS-6446 > URL: https://issues.apache.org/jira/browse/MESOS-6446 > Project: Mesos > Issue Type: Bug > Components: webui >Affects Versions: 1.0.0 >Reporter: Yan Xu >Assignee: haosdent >Priority: Blocker > Attachments: Screen Shot 2016-10-21 at 12.04.23 PM.png, > webui_metrics.gif > > > After Mesos 1.0, the webUI redirect is hidden from the users so you can go to > any of the master and the webUI is populated with state.json from the leading > master. > This doesn't include stats from /metric/snapshot though as it is not > redirected. The user ends up seeing some fields with empty values. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-6491) Mesos dashboard: Allow to download zip file of task sandbox
Mischa Krüger created MESOS-6491: Summary: Mesos dashboard: Allow to download zip file of task sandbox Key: MESOS-6491 URL: https://issues.apache.org/jira/browse/MESOS-6491 Project: Mesos Issue Type: Wish Components: webui Reporter: Mischa Krüger Priority: Minor Mesos dashboard should have a little "Download sandbox as .zip" button or similar which allows to download the complete sandbox with a single click. Makes sharing of sandboxes way easier, as there's no need to click on every file of the sandbox and download every file separately. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6484) Memory leak in `Future::after()`
[ https://issues.apache.org/jira/browse/MESOS-6484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15611284#comment-15611284 ] Alexander Rojas commented on MESOS-6484: I've been looking into this for a couple of days now. I have narrow it down to [this snippet|https://github.com/apache/mesos/blob/master/3rdparty/libprocess/include/process/future.hpp#L1411-L1415]: {code:title=future.hpp} Timer timer = Clock::timer( duration, lambda::bind(&internal::expired, f, latch, promise, *this)); onAny(lambda::bind(&internal::after, latch, promise, timer, lambda::_1)); {code} If the {{timer}} expires without the future being set, a copy of the {{timer}} is kept somewhere. However if {{Clock::cancel(timer)}} is called (because the future is set) the timer is properly destroyed. One copy is called by every expired timer. I just haven't found who owns that copy of the timer. > Memory leak in `Future::after()` > --- > > Key: MESOS-6484 > URL: https://issues.apache.org/jira/browse/MESOS-6484 > Project: Mesos > Issue Type: Bug > Components: libprocess >Affects Versions: 1.1.0 >Reporter: Alexander Rojas > Labels: libprocess, mesosphere > > The problem arises when one tries to associate an {{after()}} call to copied > futures. The following test case is enough to reproduce the issue: > {code} > class Policy > { > public: > virtual Try timeout() = 0; > virtual Duration totalTimeout() = 0; > virtual ~Policy() {} > }; > class MockPolicy : public Policy > { > public: > virtual ~MockPolicy() {} > MOCK_METHOD0(timeout, Try()); > MOCK_METHOD0(totalTimeout, Duration()); > }; > template > process::Future retry( > const std::function()>& action, > const std::shared_ptr& policy) > { > CHECK(policy != nullptr); > Try timeout = policy->timeout(); > if (timeout.isError()) { > return Future::failed(timeout.error()); > } > return action() > .after(timeout.get(), [action, policy](const Future&) { > return retry(action, policy); > }); > } > TEST(FutureTest, Retry) > { > auto policy = std::make_shared(); > EXPECT_CALL(*policy, timeout()) > .WillRepeatedly(Return(Milliseconds(1))); > unsigned callCount = 0; > auto future = retry([&callCount]() -> Future { > ++callCount; > if (callCount < 4) { > return Future(); > } > return Nothing(); > }, > policy); > AWAIT_READY(future); > EXPECT_EQ(1, policy.use_count()); > {code} > In the test, one would expect that there is only one active reference to > {{policy}}, therefore the expectation {{EXPECT_EQ(1, policy.use_count())}}. > However, if after is triggered more than once, each extra call adds one > undeleted reference to {{policy}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6489) Better support for containers that want to manage their own cgroup.
[ https://issues.apache.org/jira/browse/MESOS-6489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15611011#comment-15611011 ] Anindya Sinha commented on MESOS-6489: -- Jotting down some thoughts based on our previous conversation: In {{Future destroy(const string& hierarchy, const string& cgroup)}} 1. Extract all cgroups (including sub cgroups) in bottom up fashion via {{cgroups::get(hierarchy, cgroup)}} 2. If freezer is available: 2a. We use {{TasksKiller}} to freeze cgroups, {{SIGKILL}} all tasks, and thaw cgroups (may be in top down fashion). However, we add a new attribute to this class {{bool ignoreMissingCgroup}}. If that is set, we ignore any error for cgroups that do not exist in {{TasksKiller::finished()}}. 2b. At this point, we remove the cgroups in bottom up fashion incase there is no error reported in {{TasksKiller}}. We bail out as an error if there is any failure in removal of cgroups. Similar to step #2a, we ignore errors for cgroups that do not exist. 3. If freezer is unavailable, we remove the cgroups starting from bottom up using {{cgroups::remove(hierarchy, cgroup)}}. If remove fails due to non-presence of the cgroup, we ignore that failure, We will have the "ignore error due to missing cgroup" in 2 places, viz. {{TasksKiller::finished()}} and in {{cgroups::destroy}} > Better support for containers that want to manage their own cgroup. > --- > > Key: MESOS-6489 > URL: https://issues.apache.org/jira/browse/MESOS-6489 > Project: Mesos > Issue Type: Improvement >Reporter: Jie Yu > > Some containers want to manage their cgroup by sub-dividing the cgroup that > Mesos allocates to them into multiple sub-cgroups and put subprocess into the > corresponding sub-cgroups. > For instance, someone wants to run Docker daemon in a Mesos container. Docker > daemon will manage the cgroup assigned to it by Mesos (with the help , for > example, cgroups namespace). > Problems arise during the teardown of the container because two entities > might be manipulating the same cgroup simultaneously. For example, the Mesos > cgroups::destroy might fail if the task running inside is trying to delete > the same nested cgroup at the same time. > To support that case, we should consider kill all the processes in the Mesos > cgroup first, making sure that no one will be creating sub-cgroups and moving > new processes into sub-cgroups. And then, destroy the cgroups recursively. > And we need freezer because we want to make sure all processes are stopped > while we are sending kill signals to avoid TOCTTOU race problem. I think it > makes more sense to freezer the cgroups (and sub-cgroups) from top down > (rather than bottom up because typically, processes in the parent cgroup > manipulate sub-cgroups). -- This message was sent by Atlassian JIRA (v6.3.4#6332)