[jira] [Created] (MESOS-4773) CMake: Build master executable.
Diana Arroyo created MESOS-4773: --- Summary: CMake: Build master executable. Key: MESOS-4773 URL: https://issues.apache.org/jira/browse/MESOS-4773 Project: Mesos Issue Type: Task Components: cmake Reporter: Diana Arroyo -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4754) The "executors" field is exposed under a backwards incompatible schema.
[ https://issues.apache.org/jira/browse/MESOS-4754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15166848#comment-15166848 ] Michael Park commented on MESOS-4754: - {noformat} commit d99c778de22954c0b3f7089be45ef250386fccd1 Author: Michael ParkDate: Wed Feb 24 22:35:39 2016 -0800 Added missing `json` declaration for `ExecutorInfo`. Review: https://reviews.apache.org/r/43937/ {noformat} > The "executors" field is exposed under a backwards incompatible schema. > --- > > Key: MESOS-4754 > URL: https://issues.apache.org/jira/browse/MESOS-4754 > Project: Mesos > Issue Type: Bug > Components: master >Reporter: Michael Park >Assignee: Michael Park > Labels: mesosphere > Fix For: 0.27.2 > > > In 0.26.0, the master's {{/state}} endpoint generated the following: > {code} > { > /* ... */ > "frameworks": [ > { > /* ... */ > "executors": [ > { > "command": { > "argv": [], > "uris": [], > "value": > "/Users/mpark/Projects/mesos/build/opt/src/long-lived-executor" > }, > "executor_id": "default", > "framework_id": "0ea528a9-64ba-417f-98ea-9c4b8d418db6-", > "name": "Long Lived Executor (C++)", > "resources": { > "cpus": 0, > "disk": 0, > "mem": 0 > }, > "slave_id": "8a513678-03a1-4cb5-9279-c3c0c591f1d8-S0" > } > ], > /* ... */ > } > ] > /* ... */ > } > {code} > In 0.27.1, the {{ExecutorInfo}} is mistakenly exposed in the raw protobuf > schema: > {code} > { > /* ... */ > "frameworks": [ > { > /* ... */ > "executors": [ > { > "command": { > "shell": true, > "value": > "/Users/mpark/Projects/mesos/build/opt/src/long-lived-executor" > }, > "executor_id": { > "value": "default" > }, > "framework_id": { > "value": "368a5a49-480b-41f6-a13b-24a69c92a72e-" > }, > "name": "Long Lived Executor (C++)", > "slave_id": "8a513678-03a1-4cb5-9279-c3c0c591f1d8-S0", > "source": "cpp_long_lived_framework" > } > ], > /* ... */ > } > ] > /* ... */ > } > {code} > This is a backwards incompatible API change. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4772) TaskInfo/ExecutorInfo should include owner information
[ https://issues.apache.org/jira/browse/MESOS-4772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adam B updated MESOS-4772: -- Labels: authorization mesosphere ownership security (was: authorization ownership security) > TaskInfo/ExecutorInfo should include owner information > -- > > Key: MESOS-4772 > URL: https://issues.apache.org/jira/browse/MESOS-4772 > Project: Mesos > Issue Type: Improvement > Components: security >Reporter: Adam B > Labels: authorization, mesosphere, ownership, security > > We need a way to assign fine-grained ownership to tasks/executors so that > multi-user frameworks can tell Mesos to associate the task with a user > identity (rather than just the framework principal+role). Then, when an HTTP > user requests to view the task's sandbox contents, or kill the task, or list > all tasks, the authorizer can determine whether to allow/deny/filter the > request based on finer-grained, user-level ownership. > Some systems may want TaskInfo.owner to represent a group rather than an > individual user. That's fine as long as the framework sets the field to the > group ID in such a way that a group-aware authorizer can interpret it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-4772) TaskInfo/ExecutorInfo should include owner information
Adam B created MESOS-4772: - Summary: TaskInfo/ExecutorInfo should include owner information Key: MESOS-4772 URL: https://issues.apache.org/jira/browse/MESOS-4772 Project: Mesos Issue Type: Improvement Components: security Reporter: Adam B We need a way to assign fine-grained ownership to tasks/executors so that multi-user frameworks can tell Mesos to associate the task with a user identity (rather than just the framework principal+role). Then, when an HTTP user requests to view the task's sandbox contents, or kill the task, or list all tasks, the authorizer can determine whether to allow/deny/filter the request based on finer-grained, user-level ownership. Some systems may want TaskInfo.owner to represent a group rather than an individual user. That's fine as long as the framework sets the field to the group ID in such a way that a group-aware authorizer can interpret it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4676) ROOT_DOCKER_Logs is flaky.
[ https://issues.apache.org/jira/browse/MESOS-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15166618#comment-15166618 ] haosdent commented on MESOS-4676: - I also find this issue https://github.com/docker/docker/issues/19950 It said should be {quote} This error is from stdcopy package which muxes stdout/stderr streams. It seems like now it writes something weird; I think it can also be golang version change. {quote} And I could reproduce through the example code in the issue https://gist.github.com/dpiddy/0c460a8bb297ee19a7a0 Verify that add {{-t}} when {{docker run}} also could avoid this problem. > ROOT_DOCKER_Logs is flaky. > -- > > Key: MESOS-4676 > URL: https://issues.apache.org/jira/browse/MESOS-4676 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.27 > Environment: CentOS 7 with SSL. >Reporter: Bernd Mathiske >Assignee: Joseph Wu > Labels: flaky, mesosphere, test > > {noformat} > [18:06:25][Step 8/8] [ RUN ] DockerContainerizerTest.ROOT_DOCKER_Logs > [18:06:25][Step 8/8] I0215 17:06:25.256103 1740 leveldb.cpp:174] Opened db > in 6.548327ms > [18:06:25][Step 8/8] I0215 17:06:25.258002 1740 leveldb.cpp:181] Compacted > db in 1.837816ms > [18:06:25][Step 8/8] I0215 17:06:25.258059 1740 leveldb.cpp:196] Created db > iterator in 22044ns > [18:06:25][Step 8/8] I0215 17:06:25.258076 1740 leveldb.cpp:202] Seeked to > beginning of db in 2347ns > [18:06:25][Step 8/8] I0215 17:06:25.258091 1740 leveldb.cpp:271] Iterated > through 0 keys in the db in 571ns > [18:06:25][Step 8/8] I0215 17:06:25.258152 1740 replica.cpp:779] Replica > recovered with log positions 0 -> 0 with 1 holes and 0 unlearned > [18:06:25][Step 8/8] I0215 17:06:25.258936 1758 recover.cpp:447] Starting > replica recovery > [18:06:25][Step 8/8] I0215 17:06:25.259177 1758 recover.cpp:473] Replica is > in EMPTY status > [18:06:25][Step 8/8] I0215 17:06:25.260327 1757 replica.cpp:673] Replica in > EMPTY status received a broadcasted recover request from > (13608)@172.30.2.239:39785 > [18:06:25][Step 8/8] I0215 17:06:25.260545 1758 recover.cpp:193] Received a > recover response from a replica in EMPTY status > [18:06:25][Step 8/8] I0215 17:06:25.261065 1757 master.cpp:376] Master > 112363e2-c680-4946-8fee-d0626ed8b21e (ip-172-30-2-239.mesosphere.io) started > on 172.30.2.239:39785 > [18:06:25][Step 8/8] I0215 17:06:25.261209 1761 recover.cpp:564] Updating > replica status to STARTING > [18:06:25][Step 8/8] I0215 17:06:25.261086 1757 master.cpp:378] Flags at > startup: --acls="" --allocation_interval="1secs" > --allocator="HierarchicalDRF" --authenticate="true" > --authenticate_http="true" --authenticate_slaves="true" > --authenticators="crammd5" --authorizers="local" > --credentials="/tmp/HncLLj/credentials" --framework_sorter="drf" > --help="false" --hostname_lookup="true" --http_authenticators="basic" > --initialize_driver_logging="true" --log_auto_initialize="true" > --logbufsecs="0" --logging_level="INFO" --max_completed_frameworks="50" > --max_completed_tasks_per_framework="1000" --max_slave_ping_timeouts="5" > --quiet="false" --recovery_slave_removal_limit="100%" > --registry="replicated_log" --registry_fetch_timeout="1mins" > --registry_store_timeout="100secs" --registry_strict="true" > --root_submissions="true" --slave_ping_timeout="15secs" > --slave_reregister_timeout="10mins" --user_sorter="drf" --version="false" > --webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/HncLLj/master" > --zk_session_timeout="10secs" > [18:06:25][Step 8/8] I0215 17:06:25.261446 1757 master.cpp:423] Master only > allowing authenticated frameworks to register > [18:06:25][Step 8/8] I0215 17:06:25.261456 1757 master.cpp:428] Master only > allowing authenticated slaves to register > [18:06:25][Step 8/8] I0215 17:06:25.261462 1757 credentials.hpp:35] Loading > credentials for authentication from '/tmp/HncLLj/credentials' > [18:06:25][Step 8/8] I0215 17:06:25.261723 1757 master.cpp:468] Using > default 'crammd5' authenticator > [18:06:25][Step 8/8] I0215 17:06:25.261855 1757 master.cpp:537] Using > default 'basic' HTTP authenticator > [18:06:25][Step 8/8] I0215 17:06:25.262022 1757 master.cpp:571] > Authorization enabled > [18:06:25][Step 8/8] I0215 17:06:25.262177 1755 hierarchical.cpp:144] > Initialized hierarchical allocator process > [18:06:25][Step 8/8] I0215 17:06:25.262177 1758 whitelist_watcher.cpp:77] No > whitelist given > [18:06:25][Step 8/8] I0215 17:06:25.262899 1760 leveldb.cpp:304] Persisting > metadata (8 bytes) to leveldb took 1.517992ms > [18:06:25][Step 8/8] I0215 17:06:25.262924 1760 replica.cpp:320] Persisted > replica status to STARTING > [18:06:25][Step 8/8] I0215 17:06:25.263144 1754 recover.cpp:473] Replica is > in STARTING status > [18:06:25][Step 8/8]
[jira] [Updated] (MESOS-3078) Recovered resources are not re-allocated until the next allocation delay.
[ https://issues.apache.org/jira/browse/MESOS-3078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Klaus Ma updated MESOS-3078: Assignee: (was: Klaus Ma) > Recovered resources are not re-allocated until the next allocation delay. > - > > Key: MESOS-3078 > URL: https://issues.apache.org/jira/browse/MESOS-3078 > Project: Mesos > Issue Type: Improvement > Components: allocation >Reporter: Benjamin Mahler > > Currently, when resources are recovered, we do not perform an allocation for > that slave. Rather, we wait until the next allocation interval. > For small task, high throughput frameworks, this can have a significant > impact on overall throughput, see the following thread: > http://markmail.org/thread/y6mzfwzlurv6nik3 > We should consider immediately performing a re-allocation for the slave upon > resource recovery. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4677) LimitedCpuIsolatorTest.ROOT_CGROUPS_Pids_and_Tids is flaky.
[ https://issues.apache.org/jira/browse/MESOS-4677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15166531#comment-15166531 ] haosdent commented on MESOS-4677: - Nice analyzation! {quote} cgroups.procs doesn't change since exec doesn't change the PID. But there may be a race between updating the "threads" (cgroup/tasks) and us reading the cgroup/tasks. {quote} I think cgroup/tasks value always same as cgroup/cgroup.procs here before because we only have "cat". According to your analyzation, cgroup/tasks also would change here, right? > LimitedCpuIsolatorTest.ROOT_CGROUPS_Pids_and_Tids is flaky. > --- > > Key: MESOS-4677 > URL: https://issues.apache.org/jira/browse/MESOS-4677 > Project: Mesos > Issue Type: Bug > Components: test >Affects Versions: 0.27 >Reporter: Bernd Mathiske >Assignee: Joseph Wu > Labels: flaky, test > > This test fails very often when run on CentOS 7, but may also fail elsewhere > sometimes. Unfortunately, it tends to only fail when --verbose is not set. > The output is this: > {noformat} > [21:45:21][Step 8/8] [ RUN ] > LimitedCpuIsolatorTest.ROOT_CGROUPS_Pids_and_Tids > [21:45:21][Step 8/8] ../../src/tests/containerizer/isolator_tests.cpp:807: > Failure > [21:45:21][Step 8/8] Value of: usage.get().threads() > [21:45:21][Step 8/8] Actual: 0 > [21:45:21][Step 8/8] Expected: 1U > [21:45:21][Step 8/8] Which is: 1 > [21:45:21][Step 8/8] [ FAILED ] > LimitedCpuIsolatorTest.ROOT_CGROUPS_Pids_and_Tids (94 ms) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (MESOS-4771) Document the network/cni isolator.
[ https://issues.apache.org/jira/browse/MESOS-4771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Qian Zhang reassigned MESOS-4771: - Assignee: Qian Zhang > Document the network/cni isolator. > -- > > Key: MESOS-4771 > URL: https://issues.apache.org/jira/browse/MESOS-4771 > Project: Mesos > Issue Type: Task >Reporter: Jie Yu >Assignee: Qian Zhang > > We need to document this isolator in mesos-containerizer.md (e.g., how to > configure it, what's the pre-requisite, etc.) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4768) MasterMaintenanceTest.InverseOffers is flaky
[ https://issues.apache.org/jira/browse/MESOS-4768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph Wu updated MESOS-4768: - Shepherd: Joris Van Remoortere Assignee: Joseph Wu Sprint: Mesosphere Sprint 29 > MasterMaintenanceTest.InverseOffers is flaky > > > Key: MESOS-4768 > URL: https://issues.apache.org/jira/browse/MESOS-4768 > Project: Mesos > Issue Type: Bug > Components: tests >Affects Versions: 0.28.0 >Reporter: Joseph Wu >Assignee: Joseph Wu > Labels: mesosphere, test > > [MESOS-4169] significantly sped up this test, but also surfaced some more > flakiness. This can be fixed in the same way as [MESOS-4059]. > Verbose logs from ASF Centos7 build: > {code} > [ RUN ] MasterMaintenanceTest.InverseOffers > I0224 22:35:53.714018 1948 leveldb.cpp:174] Opened db in 2.034387ms > I0224 22:35:53.714663 1948 leveldb.cpp:181] Compacted db in 608839ns > I0224 22:35:53.714709 1948 leveldb.cpp:196] Created db iterator in 19043ns > I0224 22:35:53.714844 1948 leveldb.cpp:202] Seeked to beginning of db in > 2330ns > I0224 22:35:53.714956 1948 leveldb.cpp:271] Iterated through 0 keys in the > db in 518ns > I0224 22:35:53.715092 1948 replica.cpp:779] Replica recovered with log > positions 0 -> 0 with 1 holes and 0 unlearned > I0224 22:35:53.715646 1968 recover.cpp:447] Starting replica recovery > I0224 22:35:53.715915 1981 recover.cpp:473] Replica is in EMPTY status > I0224 22:35:53.717067 1972 replica.cpp:673] Replica in EMPTY status received > a broadcasted recover request from (4533)@172.17.0.1:36678 > I0224 22:35:53.717445 1981 recover.cpp:193] Received a recover response from > a replica in EMPTY status > I0224 22:35:53.717888 1978 recover.cpp:564] Updating replica status to > STARTING > I0224 22:35:53.718585 1979 leveldb.cpp:304] Persisting metadata (8 bytes) to > leveldb took 525061ns > I0224 22:35:53.718618 1979 replica.cpp:320] Persisted replica status to > STARTING > I0224 22:35:53.718827 1982 recover.cpp:473] Replica is in STARTING status > I0224 22:35:53.719728 1969 replica.cpp:673] Replica in STARTING status > received a broadcasted recover request from (4534)@172.17.0.1:36678 > I0224 22:35:53.719974 1971 recover.cpp:193] Received a recover response from > a replica in STARTING status > I0224 22:35:53.720369 1970 recover.cpp:564] Updating replica status to VOTING > I0224 22:35:53.720789 1982 leveldb.cpp:304] Persisting metadata (8 bytes) to > leveldb took 322308ns > I0224 22:35:53.720823 1982 replica.cpp:320] Persisted replica status to > VOTING > I0224 22:35:53.720968 1982 recover.cpp:578] Successfully joined the Paxos > group > I0224 22:35:53.721101 1982 recover.cpp:462] Recover process terminated > I0224 22:35:53.721698 1982 master.cpp:376] Master > aab18b61-7811-4c43-a672-d1a63818c880 (4db5fa128d2d) started on > 172.17.0.1:36678 > I0224 22:35:53.721719 1982 master.cpp:378] Flags at startup: --acls="" > --allocation_interval="1secs" --allocator="HierarchicalDRF" > --authenticate="false" --authenticate_http="true" > --authenticate_slaves="true" --authenticators="crammd5" --authorizers="local" > --credentials="/tmp/MjbcWP/credentials" --framework_sorter="drf" > --help="false" --hostname_lookup="true" --http_authenticators="basic" > --initialize_driver_logging="true" --log_auto_initialize="true" > --logbufsecs="0" --logging_level="INFO" --max_completed_frameworks="50" > --max_completed_tasks_per_framework="1000" --max_slave_ping_timeouts="5" > --quiet="false" --recovery_slave_removal_limit="100%" > --registry="replicated_log" --registry_fetch_timeout="1mins" > --registry_store_timeout="100secs" --registry_strict="true" > --root_submissions="true" --slave_ping_timeout="15secs" > --slave_reregister_timeout="10mins" --user_sorter="drf" --version="false" > --webui_dir="/mesos/mesos-0.28.0/_inst/share/mesos/webui" > --work_dir="/tmp/MjbcWP/master" --zk_session_timeout="10secs" > I0224 22:35:53.722039 1982 master.cpp:425] Master allowing unauthenticated > frameworks to register > I0224 22:35:53.722053 1982 master.cpp:428] Master only allowing > authenticated slaves to register > I0224 22:35:53.722061 1982 credentials.hpp:35] Loading credentials for > authentication from '/tmp/MjbcWP/credentials' > I0224 22:35:53.722394 1982 master.cpp:468] Using default 'crammd5' > authenticator > I0224 22:35:53.722525 1982 master.cpp:537] Using default 'basic' HTTP > authenticator > I0224 22:35:53.722661 1982 master.cpp:571] Authorization enabled > I0224 22:35:53.722813 1968 hierarchical.cpp:144] Initialized hierarchical > allocator process > I0224 22:35:53.722846 1980 whitelist_watcher.cpp:77] No whitelist given > I0224 22:35:53.724957 1977 master.cpp:1712] The newly elected leader is > master@172.17.0.1:36678 with id
[jira] [Created] (MESOS-4771) Document the network/cni isolator.
Jie Yu created MESOS-4771: - Summary: Document the network/cni isolator. Key: MESOS-4771 URL: https://issues.apache.org/jira/browse/MESOS-4771 Project: Mesos Issue Type: Task Reporter: Jie Yu We need to document this isolator in mesos-containerizer.md (e.g., how to configure it, what's the pre-requisite, etc.) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (MESOS-4764) The network/cni isolator should report assigned IP address.
[ https://issues.apache.org/jira/browse/MESOS-4764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Qian Zhang reassigned MESOS-4764: - Assignee: Qian Zhang > The network/cni isolator should report assigned IP address. > > > Key: MESOS-4764 > URL: https://issues.apache.org/jira/browse/MESOS-4764 > Project: Mesos > Issue Type: Task >Reporter: Jie Yu >Assignee: Qian Zhang > > In order for service discovery to work in some cases, the network/cni > isolator needs to report the assigned IP address through the > isolator->status() interface. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (MESOS-3078) Recovered resources are not re-allocated until the next allocation delay.
[ https://issues.apache.org/jira/browse/MESOS-3078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Klaus Ma reassigned MESOS-3078: --- Assignee: Klaus Ma > Recovered resources are not re-allocated until the next allocation delay. > - > > Key: MESOS-3078 > URL: https://issues.apache.org/jira/browse/MESOS-3078 > Project: Mesos > Issue Type: Improvement > Components: allocation >Reporter: Benjamin Mahler >Assignee: Klaus Ma > > Currently, when resources are recovered, we do not perform an allocation for > that slave. Rather, we wait until the next allocation interval. > For small task, high throughput frameworks, this can have a significant > impact on overall throughput, see the following thread: > http://markmail.org/thread/y6mzfwzlurv6nik3 > We should consider immediately performing a re-allocation for the slave upon > resource recovery. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-4770) Investigate performance improvements for 'Resources' class.
Benjamin Mahler created MESOS-4770: -- Summary: Investigate performance improvements for 'Resources' class. Key: MESOS-4770 URL: https://issues.apache.org/jira/browse/MESOS-4770 Project: Mesos Issue Type: Improvement Reporter: Benjamin Mahler Priority: Critical Currently we have some performance issues when we have heavy usage of the {{Resources}} class. Currently, we tend to work around these issues (e.g. reduce the amount of Resources arithmetic operations in the caller code). The implementation of {{Resources}} currently consists of wrapping underlying {{Resource}} protobuf objects and manipulating them. This is fairly expensive compared to doing things more directly with C++ objects. This ticket is to explore the performance improvements of using C++ objects more directly instead of working off of {{Resource}} objects. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4769) Update state endpoints to allow clients to determine how many resources for a given role have been used
[ https://issues.apache.org/jira/browse/MESOS-4769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Gummelt updated MESOS-4769: --- Labels: mesosphere (was: ) > Update state endpoints to allow clients to determine how many resources for a > given role have been used > --- > > Key: MESOS-4769 > URL: https://issues.apache.org/jira/browse/MESOS-4769 > Project: Mesos > Issue Type: Task >Affects Versions: 0.27.1 >Reporter: Michael Gummelt > Labels: mesosphere > > AFAICT, this is currently impossible. Say I have a cluster with 4CPUs > reserved for {{spark}} and 4CPUs unreserved, I have a framework registered as > {{spark}}, and I would like to determine how many CPUs reserved for {{Spark}} > have been used. AFAIK, there are two endpoints with interesting information: > {{/master/state}} and {{/master/roles}}. Both endpoints tell me how many > resources are used by the framework registered as {{spark}}, but it doesn't > tell me which role those resources belong to (i.e. are they reserved or > unreserved). > A simple fix would be to update {{/master/roles}} to split out resources into > "reserved" and "unreserved". However, this will fail to solve the problem if > (and hopefully when) Mesos supports multi-role frameworks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-4769) Update state endpoints to allow clients to determine how many resources for a given role have been used
Michael Gummelt created MESOS-4769: -- Summary: Update state endpoints to allow clients to determine how many resources for a given role have been used Key: MESOS-4769 URL: https://issues.apache.org/jira/browse/MESOS-4769 Project: Mesos Issue Type: Task Affects Versions: 0.27.1 Reporter: Michael Gummelt AFAICT, this is currently impossible. Say I have a cluster with 4CPUs reserved for {{spark}} and 4CPUs unreserved, I have a framework registered as {{spark}}, and I would like to determine how many CPUs reserved for {{Spark}} have been used. AFAIK, there are two endpoints with interesting information: {{/master/state}} and {{/master/roles}}. Both endpoints tell me how many resources are used by the framework registered as {{spark}}, but it doesn't tell me which role those resources belong to (i.e. are they reserved or unreserved). A simple fix would be to update {{/master/roles}} to split out resources into "reserved" and "unreserved". However, this will fail to solve the problem if (and hopefully when) Mesos supports multi-role frameworks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-4768) MasterMaintenanceTest.InverseOffers is flaky
Joseph Wu created MESOS-4768: Summary: MasterMaintenanceTest.InverseOffers is flaky Key: MESOS-4768 URL: https://issues.apache.org/jira/browse/MESOS-4768 Project: Mesos Issue Type: Bug Components: tests Affects Versions: 0.28.0 Reporter: Joseph Wu [MESOS-4169] significantly sped up this test, but also surfaced some more flakiness. This can be fixed in the same way as [MESOS-4059]. Verbose logs from ASF Centos7 build: {code} [ RUN ] MasterMaintenanceTest.InverseOffers I0224 22:35:53.714018 1948 leveldb.cpp:174] Opened db in 2.034387ms I0224 22:35:53.714663 1948 leveldb.cpp:181] Compacted db in 608839ns I0224 22:35:53.714709 1948 leveldb.cpp:196] Created db iterator in 19043ns I0224 22:35:53.714844 1948 leveldb.cpp:202] Seeked to beginning of db in 2330ns I0224 22:35:53.714956 1948 leveldb.cpp:271] Iterated through 0 keys in the db in 518ns I0224 22:35:53.715092 1948 replica.cpp:779] Replica recovered with log positions 0 -> 0 with 1 holes and 0 unlearned I0224 22:35:53.715646 1968 recover.cpp:447] Starting replica recovery I0224 22:35:53.715915 1981 recover.cpp:473] Replica is in EMPTY status I0224 22:35:53.717067 1972 replica.cpp:673] Replica in EMPTY status received a broadcasted recover request from (4533)@172.17.0.1:36678 I0224 22:35:53.717445 1981 recover.cpp:193] Received a recover response from a replica in EMPTY status I0224 22:35:53.717888 1978 recover.cpp:564] Updating replica status to STARTING I0224 22:35:53.718585 1979 leveldb.cpp:304] Persisting metadata (8 bytes) to leveldb took 525061ns I0224 22:35:53.718618 1979 replica.cpp:320] Persisted replica status to STARTING I0224 22:35:53.718827 1982 recover.cpp:473] Replica is in STARTING status I0224 22:35:53.719728 1969 replica.cpp:673] Replica in STARTING status received a broadcasted recover request from (4534)@172.17.0.1:36678 I0224 22:35:53.719974 1971 recover.cpp:193] Received a recover response from a replica in STARTING status I0224 22:35:53.720369 1970 recover.cpp:564] Updating replica status to VOTING I0224 22:35:53.720789 1982 leveldb.cpp:304] Persisting metadata (8 bytes) to leveldb took 322308ns I0224 22:35:53.720823 1982 replica.cpp:320] Persisted replica status to VOTING I0224 22:35:53.720968 1982 recover.cpp:578] Successfully joined the Paxos group I0224 22:35:53.721101 1982 recover.cpp:462] Recover process terminated I0224 22:35:53.721698 1982 master.cpp:376] Master aab18b61-7811-4c43-a672-d1a63818c880 (4db5fa128d2d) started on 172.17.0.1:36678 I0224 22:35:53.721719 1982 master.cpp:378] Flags at startup: --acls="" --allocation_interval="1secs" --allocator="HierarchicalDRF" --authenticate="false" --authenticate_http="true" --authenticate_slaves="true" --authenticators="crammd5" --authorizers="local" --credentials="/tmp/MjbcWP/credentials" --framework_sorter="drf" --help="false" --hostname_lookup="true" --http_authenticators="basic" --initialize_driver_logging="true" --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" --max_slave_ping_timeouts="5" --quiet="false" --recovery_slave_removal_limit="100%" --registry="replicated_log" --registry_fetch_timeout="1mins" --registry_store_timeout="100secs" --registry_strict="true" --root_submissions="true" --slave_ping_timeout="15secs" --slave_reregister_timeout="10mins" --user_sorter="drf" --version="false" --webui_dir="/mesos/mesos-0.28.0/_inst/share/mesos/webui" --work_dir="/tmp/MjbcWP/master" --zk_session_timeout="10secs" I0224 22:35:53.722039 1982 master.cpp:425] Master allowing unauthenticated frameworks to register I0224 22:35:53.722053 1982 master.cpp:428] Master only allowing authenticated slaves to register I0224 22:35:53.722061 1982 credentials.hpp:35] Loading credentials for authentication from '/tmp/MjbcWP/credentials' I0224 22:35:53.722394 1982 master.cpp:468] Using default 'crammd5' authenticator I0224 22:35:53.722525 1982 master.cpp:537] Using default 'basic' HTTP authenticator I0224 22:35:53.722661 1982 master.cpp:571] Authorization enabled I0224 22:35:53.722813 1968 hierarchical.cpp:144] Initialized hierarchical allocator process I0224 22:35:53.722846 1980 whitelist_watcher.cpp:77] No whitelist given I0224 22:35:53.724957 1977 master.cpp:1712] The newly elected leader is master@172.17.0.1:36678 with id aab18b61-7811-4c43-a672-d1a63818c880 I0224 22:35:53.725000 1977 master.cpp:1725] Elected as the leading master! I0224 22:35:53.725023 1977 master.cpp:1470] Recovering from registrar I0224 22:35:53.725306 1967 registrar.cpp:307] Recovering registrar I0224 22:35:53.725808 1977 log.cpp:659] Attempting to start the writer I0224 22:35:53.727145 1973 replica.cpp:493] Replica received implicit promise request from (4536)@172.17.0.1:36678 with proposal 1 I0224 22:35:53.727728 1973
[jira] [Updated] (MESOS-4573) Design doc for scheduler HTTP Stream IDs
[ https://issues.apache.org/jira/browse/MESOS-4573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Mann updated MESOS-4573: - Description: This ticket is for the design of HTTP stream IDs, for use with HTTP schedulers. These IDs allow Mesos to distinguish between different instances of HTTP framework schedulers. (was: This ticket is for the design of an HTTP session protocol for use with HTTP schedulers.) > Design doc for scheduler HTTP Stream IDs > > > Key: MESOS-4573 > URL: https://issues.apache.org/jira/browse/MESOS-4573 > Project: Mesos > Issue Type: Bug > Components: HTTP API >Reporter: Greg Mann >Assignee: Greg Mann > Labels: http, mesosphere > > This ticket is for the design of HTTP stream IDs, for use with HTTP > schedulers. These IDs allow Mesos to distinguish between different instances > of HTTP framework schedulers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4573) Design doc for scheduler HTTP Stream IDs
[ https://issues.apache.org/jira/browse/MESOS-4573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Mann updated MESOS-4573: - Summary: Design doc for scheduler HTTP Stream IDs (was: Design doc for scheduler HTTP sessions) > Design doc for scheduler HTTP Stream IDs > > > Key: MESOS-4573 > URL: https://issues.apache.org/jira/browse/MESOS-4573 > Project: Mesos > Issue Type: Bug > Components: HTTP API >Reporter: Greg Mann >Assignee: Greg Mann > Labels: http, mesosphere > > This ticket is for the design of an HTTP session protocol for use with HTTP > schedulers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4573) Design doc for scheduler HTTP Stream IDs
[ https://issues.apache.org/jira/browse/MESOS-4573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15163863#comment-15163863 ] Greg Mann commented on MESOS-4573: -- The design document can be found here: https://docs.google.com/document/d/141wvs8upivIRw7I-tW5pW9ABP2gXKMmCB8hsV36ELc0/edit?usp=sharing > Design doc for scheduler HTTP Stream IDs > > > Key: MESOS-4573 > URL: https://issues.apache.org/jira/browse/MESOS-4573 > Project: Mesos > Issue Type: Bug > Components: HTTP API >Reporter: Greg Mann >Assignee: Greg Mann > Labels: http, mesosphere > > This ticket is for the design of an HTTP session protocol for use with HTTP > schedulers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4676) ROOT_DOCKER_Logs is flaky.
[ https://issues.apache.org/jira/browse/MESOS-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph Wu updated MESOS-4676: - Sprint: Mesosphere Sprint 29 Story Points: 2 > ROOT_DOCKER_Logs is flaky. > -- > > Key: MESOS-4676 > URL: https://issues.apache.org/jira/browse/MESOS-4676 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.27 > Environment: CentOS 7 with SSL. >Reporter: Bernd Mathiske >Assignee: Joseph Wu > Labels: flaky, mesosphere, test > > {noformat} > [18:06:25][Step 8/8] [ RUN ] DockerContainerizerTest.ROOT_DOCKER_Logs > [18:06:25][Step 8/8] I0215 17:06:25.256103 1740 leveldb.cpp:174] Opened db > in 6.548327ms > [18:06:25][Step 8/8] I0215 17:06:25.258002 1740 leveldb.cpp:181] Compacted > db in 1.837816ms > [18:06:25][Step 8/8] I0215 17:06:25.258059 1740 leveldb.cpp:196] Created db > iterator in 22044ns > [18:06:25][Step 8/8] I0215 17:06:25.258076 1740 leveldb.cpp:202] Seeked to > beginning of db in 2347ns > [18:06:25][Step 8/8] I0215 17:06:25.258091 1740 leveldb.cpp:271] Iterated > through 0 keys in the db in 571ns > [18:06:25][Step 8/8] I0215 17:06:25.258152 1740 replica.cpp:779] Replica > recovered with log positions 0 -> 0 with 1 holes and 0 unlearned > [18:06:25][Step 8/8] I0215 17:06:25.258936 1758 recover.cpp:447] Starting > replica recovery > [18:06:25][Step 8/8] I0215 17:06:25.259177 1758 recover.cpp:473] Replica is > in EMPTY status > [18:06:25][Step 8/8] I0215 17:06:25.260327 1757 replica.cpp:673] Replica in > EMPTY status received a broadcasted recover request from > (13608)@172.30.2.239:39785 > [18:06:25][Step 8/8] I0215 17:06:25.260545 1758 recover.cpp:193] Received a > recover response from a replica in EMPTY status > [18:06:25][Step 8/8] I0215 17:06:25.261065 1757 master.cpp:376] Master > 112363e2-c680-4946-8fee-d0626ed8b21e (ip-172-30-2-239.mesosphere.io) started > on 172.30.2.239:39785 > [18:06:25][Step 8/8] I0215 17:06:25.261209 1761 recover.cpp:564] Updating > replica status to STARTING > [18:06:25][Step 8/8] I0215 17:06:25.261086 1757 master.cpp:378] Flags at > startup: --acls="" --allocation_interval="1secs" > --allocator="HierarchicalDRF" --authenticate="true" > --authenticate_http="true" --authenticate_slaves="true" > --authenticators="crammd5" --authorizers="local" > --credentials="/tmp/HncLLj/credentials" --framework_sorter="drf" > --help="false" --hostname_lookup="true" --http_authenticators="basic" > --initialize_driver_logging="true" --log_auto_initialize="true" > --logbufsecs="0" --logging_level="INFO" --max_completed_frameworks="50" > --max_completed_tasks_per_framework="1000" --max_slave_ping_timeouts="5" > --quiet="false" --recovery_slave_removal_limit="100%" > --registry="replicated_log" --registry_fetch_timeout="1mins" > --registry_store_timeout="100secs" --registry_strict="true" > --root_submissions="true" --slave_ping_timeout="15secs" > --slave_reregister_timeout="10mins" --user_sorter="drf" --version="false" > --webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/HncLLj/master" > --zk_session_timeout="10secs" > [18:06:25][Step 8/8] I0215 17:06:25.261446 1757 master.cpp:423] Master only > allowing authenticated frameworks to register > [18:06:25][Step 8/8] I0215 17:06:25.261456 1757 master.cpp:428] Master only > allowing authenticated slaves to register > [18:06:25][Step 8/8] I0215 17:06:25.261462 1757 credentials.hpp:35] Loading > credentials for authentication from '/tmp/HncLLj/credentials' > [18:06:25][Step 8/8] I0215 17:06:25.261723 1757 master.cpp:468] Using > default 'crammd5' authenticator > [18:06:25][Step 8/8] I0215 17:06:25.261855 1757 master.cpp:537] Using > default 'basic' HTTP authenticator > [18:06:25][Step 8/8] I0215 17:06:25.262022 1757 master.cpp:571] > Authorization enabled > [18:06:25][Step 8/8] I0215 17:06:25.262177 1755 hierarchical.cpp:144] > Initialized hierarchical allocator process > [18:06:25][Step 8/8] I0215 17:06:25.262177 1758 whitelist_watcher.cpp:77] No > whitelist given > [18:06:25][Step 8/8] I0215 17:06:25.262899 1760 leveldb.cpp:304] Persisting > metadata (8 bytes) to leveldb took 1.517992ms > [18:06:25][Step 8/8] I0215 17:06:25.262924 1760 replica.cpp:320] Persisted > replica status to STARTING > [18:06:25][Step 8/8] I0215 17:06:25.263144 1754 recover.cpp:473] Replica is > in STARTING status > [18:06:25][Step 8/8] I0215 17:06:25.264010 1757 master.cpp:1712] The newly > elected leader is master@172.30.2.239:39785 with id > 112363e2-c680-4946-8fee-d0626ed8b21e > [18:06:25][Step 8/8] I0215 17:06:25.264044 1757 master.cpp:1725] Elected as > the leading master! > [18:06:25][Step 8/8] I0215 17:06:25.264061 1757 master.cpp:1470] Recovering > from registrar > [18:06:25][Step 8/8] I0215 17:06:25.264117 1760 replica.cpp:673] Replica in > STARTING
[jira] [Assigned] (MESOS-4676) ROOT_DOCKER_Logs is flaky.
[ https://issues.apache.org/jira/browse/MESOS-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph Wu reassigned MESOS-4676: Assignee: Joseph Wu > ROOT_DOCKER_Logs is flaky. > -- > > Key: MESOS-4676 > URL: https://issues.apache.org/jira/browse/MESOS-4676 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.27 > Environment: CentOS 7 with SSL. >Reporter: Bernd Mathiske >Assignee: Joseph Wu > Labels: flaky, mesosphere, test > > {noformat} > [18:06:25][Step 8/8] [ RUN ] DockerContainerizerTest.ROOT_DOCKER_Logs > [18:06:25][Step 8/8] I0215 17:06:25.256103 1740 leveldb.cpp:174] Opened db > in 6.548327ms > [18:06:25][Step 8/8] I0215 17:06:25.258002 1740 leveldb.cpp:181] Compacted > db in 1.837816ms > [18:06:25][Step 8/8] I0215 17:06:25.258059 1740 leveldb.cpp:196] Created db > iterator in 22044ns > [18:06:25][Step 8/8] I0215 17:06:25.258076 1740 leveldb.cpp:202] Seeked to > beginning of db in 2347ns > [18:06:25][Step 8/8] I0215 17:06:25.258091 1740 leveldb.cpp:271] Iterated > through 0 keys in the db in 571ns > [18:06:25][Step 8/8] I0215 17:06:25.258152 1740 replica.cpp:779] Replica > recovered with log positions 0 -> 0 with 1 holes and 0 unlearned > [18:06:25][Step 8/8] I0215 17:06:25.258936 1758 recover.cpp:447] Starting > replica recovery > [18:06:25][Step 8/8] I0215 17:06:25.259177 1758 recover.cpp:473] Replica is > in EMPTY status > [18:06:25][Step 8/8] I0215 17:06:25.260327 1757 replica.cpp:673] Replica in > EMPTY status received a broadcasted recover request from > (13608)@172.30.2.239:39785 > [18:06:25][Step 8/8] I0215 17:06:25.260545 1758 recover.cpp:193] Received a > recover response from a replica in EMPTY status > [18:06:25][Step 8/8] I0215 17:06:25.261065 1757 master.cpp:376] Master > 112363e2-c680-4946-8fee-d0626ed8b21e (ip-172-30-2-239.mesosphere.io) started > on 172.30.2.239:39785 > [18:06:25][Step 8/8] I0215 17:06:25.261209 1761 recover.cpp:564] Updating > replica status to STARTING > [18:06:25][Step 8/8] I0215 17:06:25.261086 1757 master.cpp:378] Flags at > startup: --acls="" --allocation_interval="1secs" > --allocator="HierarchicalDRF" --authenticate="true" > --authenticate_http="true" --authenticate_slaves="true" > --authenticators="crammd5" --authorizers="local" > --credentials="/tmp/HncLLj/credentials" --framework_sorter="drf" > --help="false" --hostname_lookup="true" --http_authenticators="basic" > --initialize_driver_logging="true" --log_auto_initialize="true" > --logbufsecs="0" --logging_level="INFO" --max_completed_frameworks="50" > --max_completed_tasks_per_framework="1000" --max_slave_ping_timeouts="5" > --quiet="false" --recovery_slave_removal_limit="100%" > --registry="replicated_log" --registry_fetch_timeout="1mins" > --registry_store_timeout="100secs" --registry_strict="true" > --root_submissions="true" --slave_ping_timeout="15secs" > --slave_reregister_timeout="10mins" --user_sorter="drf" --version="false" > --webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/HncLLj/master" > --zk_session_timeout="10secs" > [18:06:25][Step 8/8] I0215 17:06:25.261446 1757 master.cpp:423] Master only > allowing authenticated frameworks to register > [18:06:25][Step 8/8] I0215 17:06:25.261456 1757 master.cpp:428] Master only > allowing authenticated slaves to register > [18:06:25][Step 8/8] I0215 17:06:25.261462 1757 credentials.hpp:35] Loading > credentials for authentication from '/tmp/HncLLj/credentials' > [18:06:25][Step 8/8] I0215 17:06:25.261723 1757 master.cpp:468] Using > default 'crammd5' authenticator > [18:06:25][Step 8/8] I0215 17:06:25.261855 1757 master.cpp:537] Using > default 'basic' HTTP authenticator > [18:06:25][Step 8/8] I0215 17:06:25.262022 1757 master.cpp:571] > Authorization enabled > [18:06:25][Step 8/8] I0215 17:06:25.262177 1755 hierarchical.cpp:144] > Initialized hierarchical allocator process > [18:06:25][Step 8/8] I0215 17:06:25.262177 1758 whitelist_watcher.cpp:77] No > whitelist given > [18:06:25][Step 8/8] I0215 17:06:25.262899 1760 leveldb.cpp:304] Persisting > metadata (8 bytes) to leveldb took 1.517992ms > [18:06:25][Step 8/8] I0215 17:06:25.262924 1760 replica.cpp:320] Persisted > replica status to STARTING > [18:06:25][Step 8/8] I0215 17:06:25.263144 1754 recover.cpp:473] Replica is > in STARTING status > [18:06:25][Step 8/8] I0215 17:06:25.264010 1757 master.cpp:1712] The newly > elected leader is master@172.30.2.239:39785 with id > 112363e2-c680-4946-8fee-d0626ed8b21e > [18:06:25][Step 8/8] I0215 17:06:25.264044 1757 master.cpp:1725] Elected as > the leading master! > [18:06:25][Step 8/8] I0215 17:06:25.264061 1757 master.cpp:1470] Recovering > from registrar > [18:06:25][Step 8/8] I0215 17:06:25.264117 1760 replica.cpp:673] Replica in > STARTING status received a
[jira] [Commented] (MESOS-4676) ROOT_DOCKER_Logs is flaky.
[ https://issues.apache.org/jira/browse/MESOS-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15163720#comment-15163720 ] Joseph Wu commented on MESOS-4676: -- Based on the linked issue ("Bug report for Docker 1.9.1 on Fedora"), it looks like docker has some sort of race when the containerized process writes to both stdout & stderr at the same time. To mitigate the test hitting this: * Try separating the two {{echo}} commands. * Try using the {{unbuffer}} utility. i.e. {{unbuffer echo foo; unbuffer echo bar 1>&2}}. See https://github.com/docker/docker/issues/1385 > ROOT_DOCKER_Logs is flaky. > -- > > Key: MESOS-4676 > URL: https://issues.apache.org/jira/browse/MESOS-4676 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.27 > Environment: CentOS 7 with SSL. >Reporter: Bernd Mathiske > Labels: flaky, mesosphere, test > > {noformat} > [18:06:25][Step 8/8] [ RUN ] DockerContainerizerTest.ROOT_DOCKER_Logs > [18:06:25][Step 8/8] I0215 17:06:25.256103 1740 leveldb.cpp:174] Opened db > in 6.548327ms > [18:06:25][Step 8/8] I0215 17:06:25.258002 1740 leveldb.cpp:181] Compacted > db in 1.837816ms > [18:06:25][Step 8/8] I0215 17:06:25.258059 1740 leveldb.cpp:196] Created db > iterator in 22044ns > [18:06:25][Step 8/8] I0215 17:06:25.258076 1740 leveldb.cpp:202] Seeked to > beginning of db in 2347ns > [18:06:25][Step 8/8] I0215 17:06:25.258091 1740 leveldb.cpp:271] Iterated > through 0 keys in the db in 571ns > [18:06:25][Step 8/8] I0215 17:06:25.258152 1740 replica.cpp:779] Replica > recovered with log positions 0 -> 0 with 1 holes and 0 unlearned > [18:06:25][Step 8/8] I0215 17:06:25.258936 1758 recover.cpp:447] Starting > replica recovery > [18:06:25][Step 8/8] I0215 17:06:25.259177 1758 recover.cpp:473] Replica is > in EMPTY status > [18:06:25][Step 8/8] I0215 17:06:25.260327 1757 replica.cpp:673] Replica in > EMPTY status received a broadcasted recover request from > (13608)@172.30.2.239:39785 > [18:06:25][Step 8/8] I0215 17:06:25.260545 1758 recover.cpp:193] Received a > recover response from a replica in EMPTY status > [18:06:25][Step 8/8] I0215 17:06:25.261065 1757 master.cpp:376] Master > 112363e2-c680-4946-8fee-d0626ed8b21e (ip-172-30-2-239.mesosphere.io) started > on 172.30.2.239:39785 > [18:06:25][Step 8/8] I0215 17:06:25.261209 1761 recover.cpp:564] Updating > replica status to STARTING > [18:06:25][Step 8/8] I0215 17:06:25.261086 1757 master.cpp:378] Flags at > startup: --acls="" --allocation_interval="1secs" > --allocator="HierarchicalDRF" --authenticate="true" > --authenticate_http="true" --authenticate_slaves="true" > --authenticators="crammd5" --authorizers="local" > --credentials="/tmp/HncLLj/credentials" --framework_sorter="drf" > --help="false" --hostname_lookup="true" --http_authenticators="basic" > --initialize_driver_logging="true" --log_auto_initialize="true" > --logbufsecs="0" --logging_level="INFO" --max_completed_frameworks="50" > --max_completed_tasks_per_framework="1000" --max_slave_ping_timeouts="5" > --quiet="false" --recovery_slave_removal_limit="100%" > --registry="replicated_log" --registry_fetch_timeout="1mins" > --registry_store_timeout="100secs" --registry_strict="true" > --root_submissions="true" --slave_ping_timeout="15secs" > --slave_reregister_timeout="10mins" --user_sorter="drf" --version="false" > --webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/HncLLj/master" > --zk_session_timeout="10secs" > [18:06:25][Step 8/8] I0215 17:06:25.261446 1757 master.cpp:423] Master only > allowing authenticated frameworks to register > [18:06:25][Step 8/8] I0215 17:06:25.261456 1757 master.cpp:428] Master only > allowing authenticated slaves to register > [18:06:25][Step 8/8] I0215 17:06:25.261462 1757 credentials.hpp:35] Loading > credentials for authentication from '/tmp/HncLLj/credentials' > [18:06:25][Step 8/8] I0215 17:06:25.261723 1757 master.cpp:468] Using > default 'crammd5' authenticator > [18:06:25][Step 8/8] I0215 17:06:25.261855 1757 master.cpp:537] Using > default 'basic' HTTP authenticator > [18:06:25][Step 8/8] I0215 17:06:25.262022 1757 master.cpp:571] > Authorization enabled > [18:06:25][Step 8/8] I0215 17:06:25.262177 1755 hierarchical.cpp:144] > Initialized hierarchical allocator process > [18:06:25][Step 8/8] I0215 17:06:25.262177 1758 whitelist_watcher.cpp:77] No > whitelist given > [18:06:25][Step 8/8] I0215 17:06:25.262899 1760 leveldb.cpp:304] Persisting > metadata (8 bytes) to leveldb took 1.517992ms > [18:06:25][Step 8/8] I0215 17:06:25.262924 1760 replica.cpp:320] Persisted > replica status to STARTING > [18:06:25][Step 8/8] I0215 17:06:25.263144 1754 recover.cpp:473] Replica is > in STARTING status > [18:06:25][Step 8/8] I0215 17:06:25.264010 1757 master.cpp:1712] The newly > elected leader is
[jira] [Created] (MESOS-4767) Apply batching to allocation events to reduce allocator backlogging.
Benjamin Mahler created MESOS-4767: -- Summary: Apply batching to allocation events to reduce allocator backlogging. Key: MESOS-4767 URL: https://issues.apache.org/jira/browse/MESOS-4767 Project: Mesos Issue Type: Improvement Components: allocation Reporter: Benjamin Mahler Per the [discussion|https://issues.apache.org/jira/browse/MESOS-3157?focusedCommentId=14728377=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14728377] that came out of MESOS-3157, we'd like to batch together outstanding allocation dispatches in order to avoid backing up the allocator. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4694) DRFAllocator takes very long to allocate resources with a large number of frameworks
[ https://issues.apache.org/jira/browse/MESOS-4694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-4694: --- Issue Type: Improvement (was: Bug) > DRFAllocator takes very long to allocate resources with a large number of > frameworks > > > Key: MESOS-4694 > URL: https://issues.apache.org/jira/browse/MESOS-4694 > Project: Mesos > Issue Type: Improvement > Components: allocation >Affects Versions: 0.26.0, 0.27.0, 0.27.1 >Reporter: Dario Rexin >Assignee: Dario Rexin > > With a growing number of connected frameworks, the allocation time grows to > very high numbers. The addition of quota in 0.27 had an additional impact on > these numbers. Running `mesos-tests.sh --benchmark > --gtest_filter=HierarchicalAllocator_BENCHMARK_Test.DeclineOffers` gives us > the following numbers: > {noformat} > [==] Running 1 test from 1 test case. > [--] Global test environment set-up. > [--] 1 test from HierarchicalAllocator_BENCHMARK_Test > [ RUN ] HierarchicalAllocator_BENCHMARK_Test.DeclineOffers > Using 2000 slaves and 200 frameworks > round 0 allocate took 2.921202secs to make 200 offers > round 1 allocate took 2.85045secs to make 200 offers > round 2 allocate took 2.823768secs to make 200 offers > {noformat} > Increasing the number of frameworks to 2000: > {noformat} > [==] Running 1 test from 1 test case. > [--] Global test environment set-up. > [--] 1 test from HierarchicalAllocator_BENCHMARK_Test > [ RUN ] HierarchicalAllocator_BENCHMARK_Test.DeclineOffers > Using 2000 slaves and 2000 frameworks > round 0 allocate took 28.209454secs to make 2000 offers > round 1 allocate took 28.469419secs to make 2000 offers > round 2 allocate took 28.138086secs to make 2000 offers > {noformat} > I was able to reduce this time by a substantial amount. After applying the > patches: > {noformat} > [==] Running 1 test from 1 test case. > [--] Global test environment set-up. > [--] 1 test from HierarchicalAllocator_BENCHMARK_Test > [ RUN ] HierarchicalAllocator_BENCHMARK_Test.DeclineOffers > Using 2000 slaves and 200 frameworks > round 0 allocate took 1.016226secs to make 2000 offers > round 1 allocate took 1.102729secs to make 2000 offers > round 2 allocate took 1.102624secs to make 2000 offers > {noformat} > And with 2000 frameworks: > {noformat} > [==] Running 1 test from 1 test case. > [--] Global test environment set-up. > [--] 1 test from HierarchicalAllocator_BENCHMARK_Test > [ RUN ] HierarchicalAllocator_BENCHMARK_Test.DeclineOffers > Using 2000 slaves and 2000 frameworks > round 0 allocate took 12.563203secs to make 2000 offers > round 1 allocate took 12.437517secs to make 2000 offers > round 2 allocate took 12.470708secs to make 2000 offers > {noformat} > The patches do 3 things to improve the performance of the allocator. > 1) The total values in the DRFSorter will be pre calculated per resource type > 2) In the allocate method, when no resources are available to allocate, we > break out of the innermost loop to prevent looping over a large number of > frameworks when we have nothing to allocate > 3) when a framework suppresses offers, we remove it from the sorter instead > of just calling continue in the allocation loop - this greatly improves > performance in the sorter and prevents looping over frameworks that don't > need resources > Assuming that most of the frameworks behave nicely and suppress offers when > they have nothing to schedule, it is fair to assume, that point 3) has the > biggest impact on the performance. If we suppress offers for 90% of the > frameworks in the benchmark test, we see following numbers: > {noformat} > ==] Running 1 test from 1 test case. > [--] Global test environment set-up. > [--] 1 test from HierarchicalAllocator_BENCHMARK_Test > [ RUN ] HierarchicalAllocator_BENCHMARK_Test.DeclineOffers > Using 200 slaves and 2000 frameworks > round 0 allocate took 11626us to make 200 offers > round 1 allocate took 22890us to make 200 offers > round 2 allocate took 21346us to make 200 offers > {noformat} > And for 200 frameworks: > {noformat} > [==] Running 1 test from 1 test case. > [--] Global test environment set-up. > [--] 1 test from HierarchicalAllocator_BENCHMARK_Test > [ RUN ] HierarchicalAllocator_BENCHMARK_Test.DeclineOffers > Using 2000 slaves and 2000 frameworks > round 0 allocate took 1.11178secs to make 2000 offers > round 1 allocate took 1.062649secs to make 2000 offers > round 2 allocate took 1.080181secs to make 2000 offers > {noformat} > Review requests: > https://reviews.apache.org/r/43665/ > https://reviews.apache.org/r/43666/ >
[jira] [Created] (MESOS-4766) Improve allocator performance.
Benjamin Mahler created MESOS-4766: -- Summary: Improve allocator performance. Key: MESOS-4766 URL: https://issues.apache.org/jira/browse/MESOS-4766 Project: Mesos Issue Type: Epic Components: allocation Reporter: Benjamin Mahler Priority: Critical This is an epic to track the various tickets around improving the performance of the allocator, including the following: * Preventing un-necessary backup of the allocator. * Reducing the cost of allocations and allocator state updates. * Improving performance of the DRF sorter. * More benchmarking to simulate scenarios with performance issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4738) Expose egress bandwidth as a resource
[ https://issues.apache.org/jira/browse/MESOS-4738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15163563#comment-15163563 ] Jie Yu commented on MESOS-4738: --- Do we have a shepherd for this work? [~idownes], are you going to shepherd this work? > Expose egress bandwidth as a resource > - > > Key: MESOS-4738 > URL: https://issues.apache.org/jira/browse/MESOS-4738 > Project: Mesos > Issue Type: Improvement >Reporter: Sargun Dhillon >Assignee: Cong Wang >Priority: Minor > Labels: mesosphere > > Some of our users care about variable network network isolation. Although we > cannot fundamentally limit ingress network bandwidth, having it as a > resource, so we can drop packets above a specific limit would be attractive. > It would be nice to expose egress and ingress bandwidth as an agent resource, > perhaps with a default of 10,000 mbps, and we can allow people to adjust as > needed. Alternatively, a more advanced design would involve generating > heuristics based on an analysis of the network MII / PHY. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (MESOS-4738) Expose egress bandwidth as a resource
[ https://issues.apache.org/jira/browse/MESOS-4738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cong Wang reassigned MESOS-4738: Assignee: Cong Wang > Expose egress bandwidth as a resource > - > > Key: MESOS-4738 > URL: https://issues.apache.org/jira/browse/MESOS-4738 > Project: Mesos > Issue Type: Improvement >Reporter: Sargun Dhillon >Assignee: Cong Wang >Priority: Minor > Labels: mesosphere > > Some of our users care about variable network network isolation. Although we > cannot fundamentally limit ingress network bandwidth, having it as a > resource, so we can drop packets above a specific limit would be attractive. > It would be nice to expose egress and ingress bandwidth as an agent resource, > perhaps with a default of 10,000 mbps, and we can allow people to adjust as > needed. Alternatively, a more advanced design would involve generating > heuristics based on an analysis of the network MII / PHY. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-4765) Add equality operator for `process::http::URL` objects.
Anand Mazumdar created MESOS-4765: - Summary: Add equality operator for `process::http::URL` objects. Key: MESOS-4765 URL: https://issues.apache.org/jira/browse/MESOS-4765 Project: Mesos Issue Type: Task Components: HTTP API, libprocess Reporter: Anand Mazumdar Priority: Minor Currently two {{process::http::URL}} objects cannot be compared. It would be good to add an equality operator for comparing them. This might require a hostname lookup provided that the {{URL}} object was constructed from {{domain}} and not from {{net::IP}}. The other details can be similar to the equality operator semantics of the corresponding Java 7 URL object: https://docs.oracle.com/javase/7/docs/api/java/net/URL.html#equals(java.lang.Object) Also, it would also us to get rid of the corresponding {{URL}} object comparison in {{type_utils.cpp}} that just compares if the serialized strings match. {code} // TODO(bmahler): Leverage process::http::URL for equality. bool operator==(const URL& left, const URL& right) { return left.SerializeAsString() == right.SerializeAsString(); } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4641) Support Container Network Interface (CNI).
[ https://issues.apache.org/jira/browse/MESOS-4641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jie Yu updated MESOS-4641: -- Epic Name: CNI Support (was: cni) > Support Container Network Interface (CNI). > -- > > Key: MESOS-4641 > URL: https://issues.apache.org/jira/browse/MESOS-4641 > Project: Mesos > Issue Type: Epic >Reporter: Jie Yu >Assignee: Qian Zhang > > CoreOS developed the Container Network Interface (CNI), a proposed standard > for configuring network interfaces for Linux containers. Many CNI plugins > (e.g., calico) have already been developed. > https://coreos.com/blog/rkt-cni-networking.html > https://github.com/appc/cni/blob/master/SPEC.md > Kubernetes supports CNI as well. > http://blog.kubernetes.io/2016/01/why-Kubernetes-doesnt-use-libnetwork.html > In the context of Unified Containerizer, it would be nice if we can have a > 'network/cni' isolator which will speak the CNI protocol and prepare the > network for the container. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-4764) The network/cni isolator should report assigned IP address.
Jie Yu created MESOS-4764: - Summary: The network/cni isolator should report assigned IP address. Key: MESOS-4764 URL: https://issues.apache.org/jira/browse/MESOS-4764 Project: Mesos Issue Type: Task Reporter: Jie Yu In order for service discovery to work in some cases, the network/cni isolator needs to report the assigned IP address through the isolator->status() interface. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4763) Add test mock for CNI plugins.
[ https://issues.apache.org/jira/browse/MESOS-4763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jie Yu updated MESOS-4763: -- Labels: mesosphere (was: ) > Add test mock for CNI plugins. > -- > > Key: MESOS-4763 > URL: https://issues.apache.org/jira/browse/MESOS-4763 > Project: Mesos > Issue Type: Task >Reporter: Jie Yu >Assignee: Avinash Sridharan > Labels: mesosphere > > In order to test the network/cni isolator, we need to mock the behavior of an > CNI plugin. One option is to write a mock script which acts as a CNI plugin. > The isolator will talk to the mock script the same way it talks to an actual > CNI plugin. > The mock script can just join the host network? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (MESOS-4677) LimitedCpuIsolatorTest.ROOT_CGROUPS_Pids_and_Tids is flaky.
[ https://issues.apache.org/jira/browse/MESOS-4677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph Wu reassigned MESOS-4677: Assignee: Joseph Wu > LimitedCpuIsolatorTest.ROOT_CGROUPS_Pids_and_Tids is flaky. > --- > > Key: MESOS-4677 > URL: https://issues.apache.org/jira/browse/MESOS-4677 > Project: Mesos > Issue Type: Bug > Components: test >Affects Versions: 0.27 >Reporter: Bernd Mathiske >Assignee: Joseph Wu > Labels: flaky, test > > This test fails very often when run on CentOS 7, but may also fail elsewhere > sometimes. Unfortunately, it tends to only fail when --verbose is not set. > The output is this: > {noformat} > [21:45:21][Step 8/8] [ RUN ] > LimitedCpuIsolatorTest.ROOT_CGROUPS_Pids_and_Tids > [21:45:21][Step 8/8] ../../src/tests/containerizer/isolator_tests.cpp:807: > Failure > [21:45:21][Step 8/8] Value of: usage.get().threads() > [21:45:21][Step 8/8] Actual: 0 > [21:45:21][Step 8/8] Expected: 1U > [21:45:21][Step 8/8] Which is: 1 > [21:45:21][Step 8/8] [ FAILED ] > LimitedCpuIsolatorTest.ROOT_CGROUPS_Pids_and_Tids (94 ms) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-4763) Add test mock for CNI plugins.
Jie Yu created MESOS-4763: - Summary: Add test mock for CNI plugins. Key: MESOS-4763 URL: https://issues.apache.org/jira/browse/MESOS-4763 Project: Mesos Issue Type: Task Reporter: Jie Yu Assignee: Avinash Sridharan In order to test the network/cni isolator, we need to mock the behavior of an CNI plugin. One option is to write a mock script which acts as a CNI plugin. The isolator will talk to the mock script the same way it talks to an actual CNI plugin. The mock script can just join the host network? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4742) Design doc for CNI isolator
[ https://issues.apache.org/jira/browse/MESOS-4742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jie Yu updated MESOS-4742: -- Labels: mesosphere (was: ) > Design doc for CNI isolator > --- > > Key: MESOS-4742 > URL: https://issues.apache.org/jira/browse/MESOS-4742 > Project: Mesos > Issue Type: Documentation > Components: isolation >Reporter: Qian Zhang >Assignee: Qian Zhang > Labels: mesosphere > > This ticket is for the design of isolator for Container Network Interface > (CNI). > Design doc: > https://docs.google.com/document/d/1FFZwPHPZqS17cRQvsbbWyQbZpwIoHFR_N6AAApRv514/edit?usp=sharing -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-4762) Setup proper DNS resolver for containers in network/cni isolator.
Jie Yu created MESOS-4762: - Summary: Setup proper DNS resolver for containers in network/cni isolator. Key: MESOS-4762 URL: https://issues.apache.org/jira/browse/MESOS-4762 Project: Mesos Issue Type: Task Reporter: Jie Yu Assignee: Avinash Sridharan Please get more context from the design doc (MESOS-4742). The CNI plugin will return the DNS information about the network. The network/cni isolator needs to properly setup /etc/resolv.conf for the container. We should consider the following cases: 1) container is using host filesystem 2) container is using a different filesystem 3) custom executor and command executor -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4762) Setup proper DNS resolver for containers in network/cni isolator.
[ https://issues.apache.org/jira/browse/MESOS-4762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jie Yu updated MESOS-4762: -- Labels: mesosphere (was: ) > Setup proper DNS resolver for containers in network/cni isolator. > - > > Key: MESOS-4762 > URL: https://issues.apache.org/jira/browse/MESOS-4762 > Project: Mesos > Issue Type: Task >Reporter: Jie Yu >Assignee: Avinash Sridharan > Labels: mesosphere > > Please get more context from the design doc (MESOS-4742). > The CNI plugin will return the DNS information about the network. The > network/cni isolator needs to properly setup /etc/resolv.conf for the > container. We should consider the following cases: > 1) container is using host filesystem > 2) container is using a different filesystem > 3) custom executor and command executor -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-4761) Add agent flags to allow operators to specify CNI plugin and config directories.
Jie Yu created MESOS-4761: - Summary: Add agent flags to allow operators to specify CNI plugin and config directories. Key: MESOS-4761 URL: https://issues.apache.org/jira/browse/MESOS-4761 Project: Mesos Issue Type: Task Reporter: Jie Yu Assignee: Qian Zhang According to design doc, we plan to add the following flags: “--network_cni_plugins_dir” Location of the CNI plugin binaries. The “network/cni” isolator will find CNI plugins under this directory so that it can execute the plugins to add/delete container from the CNI networks. It is the operator’s responsibility to install the CNI plugin binaries in the specified directory. “--network_cni_config_dir” Location of the CNI network configuration files. For each network that containers launched in Mesos agent can connect to, the operator should install a network configuration file in JSON format in the specified directory. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4742) Design doc for CNI isolator
[ https://issues.apache.org/jira/browse/MESOS-4742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Avinash Sridharan updated MESOS-4742: - Issue Type: Documentation (was: Bug) > Design doc for CNI isolator > --- > > Key: MESOS-4742 > URL: https://issues.apache.org/jira/browse/MESOS-4742 > Project: Mesos > Issue Type: Documentation > Components: isolation >Reporter: Qian Zhang >Assignee: Qian Zhang > > This ticket is for the design of isolator for Container Network Interface > (CNI). > Design doc: > https://docs.google.com/document/d/1FFZwPHPZqS17cRQvsbbWyQbZpwIoHFR_N6AAApRv514/edit?usp=sharing -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4758) Add a 'name' field into NetworkInfo.
[ https://issues.apache.org/jira/browse/MESOS-4758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jie Yu updated MESOS-4758: -- Issue Type: Task (was: Bug) > Add a 'name' field into NetworkInfo. > > > Key: MESOS-4758 > URL: https://issues.apache.org/jira/browse/MESOS-4758 > Project: Mesos > Issue Type: Task >Reporter: Jie Yu >Assignee: Qian Zhang > > This allows the framework writer to specify the name of the network they want > their container to join. > Why not using 'groups'? That's because there might be multiple groups under a > single network (e.g., admin vs. user, public vs. private, etc.). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4759) Add network/cni isolator for Mesos containerizer.
[ https://issues.apache.org/jira/browse/MESOS-4759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jie Yu updated MESOS-4759: -- Issue Type: Task (was: Bug) > Add network/cni isolator for Mesos containerizer. > - > > Key: MESOS-4759 > URL: https://issues.apache.org/jira/browse/MESOS-4759 > Project: Mesos > Issue Type: Task >Reporter: Jie Yu >Assignee: Qian Zhang > > See the design doc for more context (MESOS-4742). > The isolator will interact with CNI plugins to create the network for the > container to join. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-4759) Add network/cni isolator for Mesos containerizer.
Jie Yu created MESOS-4759: - Summary: Add network/cni isolator for Mesos containerizer. Key: MESOS-4759 URL: https://issues.apache.org/jira/browse/MESOS-4759 Project: Mesos Issue Type: Bug Reporter: Jie Yu See the design doc for more context (MESOS-4742). The isolator will interact with CNI plugins to create the network for the container to join. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4759) Add network/cni isolator for Mesos containerizer.
[ https://issues.apache.org/jira/browse/MESOS-4759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jie Yu updated MESOS-4759: -- Shepherd: Jie Yu > Add network/cni isolator for Mesos containerizer. > - > > Key: MESOS-4759 > URL: https://issues.apache.org/jira/browse/MESOS-4759 > Project: Mesos > Issue Type: Bug >Reporter: Jie Yu >Assignee: Qian Zhang > > See the design doc for more context (MESOS-4742). > The isolator will interact with CNI plugins to create the network for the > container to join. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4759) Add network/cni isolator for Mesos containerizer.
[ https://issues.apache.org/jira/browse/MESOS-4759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jie Yu updated MESOS-4759: -- Assignee: Qian Zhang > Add network/cni isolator for Mesos containerizer. > - > > Key: MESOS-4759 > URL: https://issues.apache.org/jira/browse/MESOS-4759 > Project: Mesos > Issue Type: Bug >Reporter: Jie Yu >Assignee: Qian Zhang > > See the design doc for more context (MESOS-4742). > The isolator will interact with CNI plugins to create the network for the > container to join. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-4760) Expose metrics and gauges for fetcher cache usage and hit rate
Michael Browning created MESOS-4760: --- Summary: Expose metrics and gauges for fetcher cache usage and hit rate Key: MESOS-4760 URL: https://issues.apache.org/jira/browse/MESOS-4760 Project: Mesos Issue Type: Improvement Components: fetcher, statistics Reporter: Michael Browning Priority: Minor To evaluate the fetcher cache and calibrate the value of the fetcher_cache_size flag, it would be useful to have metrics and gauges on agents that expose operational statistics like cache hit rate, occupied cache size, and time spent downloading resources that were not present. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4758) Add a 'name' field into NetworkInfo.
[ https://issues.apache.org/jira/browse/MESOS-4758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jie Yu updated MESOS-4758: -- Story Points: 1 > Add a 'name' field into NetworkInfo. > > > Key: MESOS-4758 > URL: https://issues.apache.org/jira/browse/MESOS-4758 > Project: Mesos > Issue Type: Bug >Reporter: Jie Yu >Assignee: Qian Zhang > > This allows the framework writer to specify the name of the network they want > their container to join. > Why not using 'groups'? That's because there might be multiple groups under a > single network (e.g., admin vs. user, public vs. private, etc.). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-4758) Add a 'name' field into NetworkInfo.
Jie Yu created MESOS-4758: - Summary: Add a 'name' field into NetworkInfo. Key: MESOS-4758 URL: https://issues.apache.org/jira/browse/MESOS-4758 Project: Mesos Issue Type: Bug Reporter: Jie Yu Assignee: Qian Zhang This allows the framework writer to specify the name of the network they want their container to join. Why not using 'groups'? That's because there might be multiple groups under a single network (e.g., admin vs. user, public vs. private, etc.). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4742) Design doc for CNI isolator
[ https://issues.apache.org/jira/browse/MESOS-4742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jie Yu updated MESOS-4742: -- Description: This ticket is for the design of isolator for Container Network Interface (CNI). Design doc: https://docs.google.com/document/d/1FFZwPHPZqS17cRQvsbbWyQbZpwIoHFR_N6AAApRv514/edit?usp=sharing was:This ticket is for the design of isolator for Container Network Interface (CNI). > Design doc for CNI isolator > --- > > Key: MESOS-4742 > URL: https://issues.apache.org/jira/browse/MESOS-4742 > Project: Mesos > Issue Type: Bug > Components: isolation >Reporter: Qian Zhang >Assignee: Qian Zhang > > This ticket is for the design of isolator for Container Network Interface > (CNI). > Design doc: > https://docs.google.com/document/d/1FFZwPHPZqS17cRQvsbbWyQbZpwIoHFR_N6AAApRv514/edit?usp=sharing -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (MESOS-4757) Mesos containerizer should get uid/gids before pivot_root.
[ https://issues.apache.org/jira/browse/MESOS-4757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jie Yu reassigned MESOS-4757: - Assignee: Jie Yu > Mesos containerizer should get uid/gids before pivot_root. > -- > > Key: MESOS-4757 > URL: https://issues.apache.org/jira/browse/MESOS-4757 > Project: Mesos > Issue Type: Bug >Reporter: Jie Yu >Assignee: Jie Yu > > Currently, we call os::su(user) after pivot_root. This is problematic because > /etc/passwd and /etc/group might be missing in container's root filesystem. > We should instead, get the uid/gids before pivot_root, and call > setuid/setgroups after pivot_root. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-4757) Mesos containerizer should get uid/gids before pivot_root.
Jie Yu created MESOS-4757: - Summary: Mesos containerizer should get uid/gids before pivot_root. Key: MESOS-4757 URL: https://issues.apache.org/jira/browse/MESOS-4757 Project: Mesos Issue Type: Bug Reporter: Jie Yu Currently, we call os::su(user) after pivot_root. This is problematic because /etc/passwd and /etc/group might be missing in container's root filesystem. We should instead, get the uid/gids before pivot_root, and call setuid/setgroups after pivot_root. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4738) Expose egress bandwidth as a resource
[ https://issues.apache.org/jira/browse/MESOS-4738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15163473#comment-15163473 ] Sargun Dhillon commented on MESOS-4738: --- Yeah, we're using DRR for egress load balancing at the moment with our security stuff. > Expose egress bandwidth as a resource > - > > Key: MESOS-4738 > URL: https://issues.apache.org/jira/browse/MESOS-4738 > Project: Mesos > Issue Type: Improvement >Reporter: Sargun Dhillon >Priority: Minor > Labels: mesosphere > > Some of our users care about variable network network isolation. Although we > cannot fundamentally limit ingress network bandwidth, having it as a > resource, so we can drop packets above a specific limit would be attractive. > It would be nice to expose egress and ingress bandwidth as an agent resource, > perhaps with a default of 10,000 mbps, and we can allow people to adjust as > needed. Alternatively, a more advanced design would involve generating > heuristics based on an analysis of the network MII / PHY. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-4756) DockerContainerizerTest.ROOT_DOCKER_DockerInspectDiscard is flaky on CentOS 6
Joseph Wu created MESOS-4756: Summary: DockerContainerizerTest.ROOT_DOCKER_DockerInspectDiscard is flaky on CentOS 6 Key: MESOS-4756 URL: https://issues.apache.org/jira/browse/MESOS-4756 Project: Mesos Issue Type: Bug Components: tests Affects Versions: 0.27 Environment: Centos6 (AWS) + GCC 4.9 Reporter: Joseph Wu {code} [ RUN ] DockerContainerizerTest.ROOT_DOCKER_DockerInspectDiscard I0224 17:50:26.577450 17755 leveldb.cpp:174] Opened db in 6.715352ms I0224 17:50:26.579607 17755 leveldb.cpp:181] Compacted db in 2.128954ms I0224 17:50:26.579648 17755 leveldb.cpp:196] Created db iterator in 16927ns I0224 17:50:26.579661 17755 leveldb.cpp:202] Seeked to beginning of db in 1408ns I0224 17:50:26.579669 17755 leveldb.cpp:271] Iterated through 0 keys in the db in 343ns I0224 17:50:26.579721 17755 replica.cpp:779] Replica recovered with log positions 0 -> 0 with 1 holes and 0 unlearned I0224 17:50:26.580185 17776 recover.cpp:447] Starting replica recovery I0224 17:50:26.580382 17776 recover.cpp:473] Replica is in EMPTY status I0224 17:50:26.581264 17770 replica.cpp:673] Replica in EMPTY status received a broadcasted recover request from (14098)@172.30.2.121:33050 I0224 17:50:26.581771 17772 recover.cpp:193] Received a recover response from a replica in EMPTY status I0224 17:50:26.582188 17771 recover.cpp:564] Updating replica status to STARTING I0224 17:50:26.583030 17772 master.cpp:376] Master 00a3ac12-9e76-48f5-92fa-48770b82035d (ip-172-30-2-121.mesosphere.io) started on 172.30.2.121:33050 I0224 17:50:26.583051 17772 master.cpp:378] Flags at startup: --acls="" --allocation_interval="1secs" --allocator="HierarchicalDRF" --authenticate="true" --authenticate_http="true" --authenticate_slaves="true" --authenticators="crammd5" --authorizers="local" --credentials="/tmp/jSZ9of/credentials" --framework_sorter="drf" --help="false" --hostname_lookup="true" --http_authenticators="basic" --initialize_driver_logging="true" --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" --max_slave_ping_timeouts="5" --quiet="false" --recovery_slave_removal_limit="100%" --registry="replicated_log" --registry_fetch_timeout="1mins" --registry_store_timeout="100secs" --registry_strict="true" --root_submissions="true" --slave_ping_timeout="15secs" --slave_reregister_timeout="10mins" --user_sorter="drf" --version="false" --webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/jSZ9of/master" --zk_session_timeout="10secs" I0224 17:50:26.583328 17772 master.cpp:423] Master only allowing authenticated frameworks to register I0224 17:50:26.583336 17772 master.cpp:428] Master only allowing authenticated slaves to register I0224 17:50:26.583343 17772 credentials.hpp:35] Loading credentials for authentication from '/tmp/jSZ9of/credentials' I0224 17:50:26.583901 17772 master.cpp:468] Using default 'crammd5' authenticator I0224 17:50:26.584022 17772 master.cpp:537] Using default 'basic' HTTP authenticator I0224 17:50:26.584141 17772 master.cpp:571] Authorization enabled I0224 17:50:26.584234 17770 leveldb.cpp:304] Persisting metadata (8 bytes) to leveldb took 1.955608ms I0224 17:50:26.584264 17770 replica.cpp:320] Persisted replica status to STARTING I0224 17:50:26.584285 17771 hierarchical.cpp:144] Initialized hierarchical allocator process I0224 17:50:26.584295 17773 whitelist_watcher.cpp:77] No whitelist given I0224 17:50:26.584463 17775 recover.cpp:473] Replica is in STARTING status I0224 17:50:26.585260 17771 replica.cpp:673] Replica in STARTING status received a broadcasted recover request from (14100)@172.30.2.121:33050 I0224 17:50:26.585553 1 recover.cpp:193] Received a recover response from a replica in STARTING status I0224 17:50:26.586042 17773 recover.cpp:564] Updating replica status to VOTING I0224 17:50:26.586091 17770 master.cpp:1712] The newly elected leader is master@172.30.2.121:33050 with id 00a3ac12-9e76-48f5-92fa-48770b82035d I0224 17:50:26.586122 17770 master.cpp:1725] Elected as the leading master! I0224 17:50:26.586146 17770 master.cpp:1470] Recovering from registrar I0224 17:50:26.586294 17773 registrar.cpp:307] Recovering registrar I0224 17:50:26.588148 17776 leveldb.cpp:304] Persisting metadata (8 bytes) to leveldb took 1.89126ms I0224 17:50:26.588171 17776 replica.cpp:320] Persisted replica status to VOTING I0224 17:50:26.588260 17772 recover.cpp:578] Successfully joined the Paxos group I0224 17:50:26.588440 17772 recover.cpp:462] Recover process terminated I0224 17:50:26.588770 17773 log.cpp:659] Attempting to start the writer I0224 17:50:26.589782 17770 replica.cpp:493] Replica received implicit promise request from (14101)@172.30.2.121:33050 with proposal 1 I0224 17:50:26.591498 17770 leveldb.cpp:304] Persisting metadata (8 bytes) to leveldb
[jira] [Commented] (MESOS-4677) LimitedCpuIsolatorTest.ROOT_CGROUPS_Pids_and_Tids is flaky.
[ https://issues.apache.org/jira/browse/MESOS-4677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15163417#comment-15163417 ] Joseph Wu commented on MESOS-4677: -- My guess is this: # The first {{usage = isolator.get()->usage(containerId);}} comes right after we isolate the test process, by writing to {{cgroup.procs}}. Underneath, the cgroups API probably blocks the write from completing until the cgroups are updated. # We do an {{os::close}} on a parent pipe to trigger the test process into {{exec}} ing. # We immediately call {{usage = isolator.get()->usage(containerId);}} again. # {{cgroups.procs}} doesn't change since {{exec}} doesn't change the PID. But there may be a race between updating the "threads" ({{cgroup/tasks}}) and us reading the {{cgroup/tasks}}. We can either: * Import the {{cgroups.h}} header and use {{cgroups_lock}}/{{cgroups_unlock}} to synchronize. * Add a sleep between closing the parent pipe and calling {{->usage(...)}}. * Do some sort of operation on the test process (which would confirm that it is finished {{exec}} ing). In this case we can write to the {{cat}} test process and read the echoed result. > LimitedCpuIsolatorTest.ROOT_CGROUPS_Pids_and_Tids is flaky. > --- > > Key: MESOS-4677 > URL: https://issues.apache.org/jira/browse/MESOS-4677 > Project: Mesos > Issue Type: Bug > Components: test >Affects Versions: 0.27 >Reporter: Bernd Mathiske > Labels: flaky, test > > This test fails very often when run on CentOS 7, but may also fail elsewhere > sometimes. Unfortunately, it tends to only fail when --verbose is not set. > The output is this: > {noformat} > [21:45:21][Step 8/8] [ RUN ] > LimitedCpuIsolatorTest.ROOT_CGROUPS_Pids_and_Tids > [21:45:21][Step 8/8] ../../src/tests/containerizer/isolator_tests.cpp:807: > Failure > [21:45:21][Step 8/8] Value of: usage.get().threads() > [21:45:21][Step 8/8] Actual: 0 > [21:45:21][Step 8/8] Expected: 1U > [21:45:21][Step 8/8] Which is: 1 > [21:45:21][Step 8/8] [ FAILED ] > LimitedCpuIsolatorTest.ROOT_CGROUPS_Pids_and_Tids (94 ms) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4602) Invalid usage of ATOMIC_FLAG_INIT in member initialization
[ https://issues.apache.org/jira/browse/MESOS-4602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15163404#comment-15163404 ] Anand Mazumdar commented on MESOS-4602: --- {code} commit fd1101db8af8f3ea684a09e2f1d79f5fa9b69496 Author: Yong Tang yong.tang.git...@outlook.com Date: Tue Feb 23 10:47:15 2016 +0100 Fixed invalid usage of ATOMIC_FLAG_INIT in libprocess. Review: https://reviews.apache.org/r/43859/ {code} > Invalid usage of ATOMIC_FLAG_INIT in member initialization > -- > > Key: MESOS-4602 > URL: https://issues.apache.org/jira/browse/MESOS-4602 > Project: Mesos > Issue Type: Bug > Components: libprocess >Reporter: Benjamin Bannier >Assignee: Yong Tang > Labels: newbie, tech-debt > > MESOS-2925 fixed a few instances where {{ATOMIC_FLAG_INIT}} was used in > initializer lists, but missed to fix > {{3rdparty/libprocess/src/libevent_ssl_socket.cpp}} (even though the > corresponding header was touched). > There, {{LibeventSSLSocketImpl}}'s {{lock}} member is still (incorrectly) > initialized in initializer lists, even though the member is already > initialized in the class declaration, so it appears they should be dropped. > Clang from trunk incorrectly diagnoses the initializations in the initializer > lists as benign redundant braces in initialization of a scalar, but they > should be fixed for the reasons stated in MESOS-2925. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4602) Invalid usage of ATOMIC_FLAG_INIT in member initialization
[ https://issues.apache.org/jira/browse/MESOS-4602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15163399#comment-15163399 ] Yong Tang commented on MESOS-4602: -- The patch has been applied. Thanks [~bbannier] and [~tillt] for reviews. > Invalid usage of ATOMIC_FLAG_INIT in member initialization > -- > > Key: MESOS-4602 > URL: https://issues.apache.org/jira/browse/MESOS-4602 > Project: Mesos > Issue Type: Bug > Components: libprocess >Reporter: Benjamin Bannier >Assignee: Yong Tang > Labels: newbie, tech-debt > > MESOS-2925 fixed a few instances where {{ATOMIC_FLAG_INIT}} was used in > initializer lists, but missed to fix > {{3rdparty/libprocess/src/libevent_ssl_socket.cpp}} (even though the > corresponding header was touched). > There, {{LibeventSSLSocketImpl}}'s {{lock}} member is still (incorrectly) > initialized in initializer lists, even though the member is already > initialized in the class declaration, so it appears they should be dropped. > Clang from trunk incorrectly diagnoses the initializations in the initializer > lists as benign redundant braces in initialization of a scalar, but they > should be fixed for the reasons stated in MESOS-2925. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4492) Add metrics for {RESERVE, UNRESERVE} and {CREATE, DESTROY} offer operation
[ https://issues.apache.org/jira/browse/MESOS-4492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15163382#comment-15163382 ] Greg Mann commented on MESOS-4492: -- Sure, I'm happy to help review! [~fan.du], if you could post a link to the review request here when you submit it, I'll have a look :-) > Add metrics for {RESERVE, UNRESERVE} and {CREATE, DESTROY} offer operation > -- > > Key: MESOS-4492 > URL: https://issues.apache.org/jira/browse/MESOS-4492 > Project: Mesos > Issue Type: Improvement > Components: master >Reporter: Fan Du >Assignee: Fan Du >Priority: Minor > > This ticket aims to enable user or operator to inspect operation statistics > such as RESERVE, UNRESERVE, CREATE and DESTROY, current implementation only > supports LAUNCH. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4047) MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery is flaky
[ https://issues.apache.org/jira/browse/MESOS-4047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph Wu updated MESOS-4047: - Assignee: Alexander Rojas (was: Joseph Wu) Sprint: Mesosphere Sprint 23, Mesosphere Sprint 24, Mesosphere Sprint 29 (was: Mesosphere Sprint 23, Mesosphere Sprint 24) Fix Version/s: 0.28.0 > MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery is flaky > --- > > Key: MESOS-4047 > URL: https://issues.apache.org/jira/browse/MESOS-4047 > Project: Mesos > Issue Type: Bug > Components: test >Affects Versions: 0.26.0 > Environment: Ubuntu 14, gcc 4.8.4 >Reporter: Joseph Wu >Assignee: Alexander Rojas > Labels: flaky, flaky-test > Fix For: 0.27.0, 0.28.0 > > > {code:title=Output from passed test} > [--] 1 test from MemoryPressureMesosTest > 1+0 records in > 1+0 records out > 1048576 bytes (1.0 MB) copied, 0.000430889 s, 2.4 GB/s > [ RUN ] MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery > I1202 11:09:14.319327 5062 exec.cpp:134] Version: 0.27.0 > I1202 11:09:14.17 5079 exec.cpp:208] Executor registered on slave > bea15b35-9aa1-4b57-96fb-29b5f70638ac-S0 > Registered executor on ubuntu > Starting task 4e62294c-cfcf-4a13-b699-c6a4b7ac5162 > sh -c 'while true; do dd count=512 bs=1M if=/dev/zero of=./temp; done' > Forked command at 5085 > I1202 11:09:14.391739 5077 exec.cpp:254] Received reconnect request from > slave bea15b35-9aa1-4b57-96fb-29b5f70638ac-S0 > I1202 11:09:14.398598 5082 exec.cpp:231] Executor re-registered on slave > bea15b35-9aa1-4b57-96fb-29b5f70638ac-S0 > Re-registered executor on ubuntu > Shutting down > Sending SIGTERM to process tree at pid 5085 > Killing the following process trees: > [ > -+- 5085 sh -c while true; do dd count=512 bs=1M if=/dev/zero of=./temp; done > \--- 5086 dd count=512 bs=1M if=/dev/zero of=./temp > ] > [ OK ] MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery (1096 ms) > {code} > {code:title=Output from failed test} > [--] 1 test from MemoryPressureMesosTest > 1+0 records in > 1+0 records out > 1048576 bytes (1.0 MB) copied, 0.000404489 s, 2.6 GB/s > [ RUN ] MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery > I1202 11:09:15.509950 5109 exec.cpp:134] Version: 0.27.0 > I1202 11:09:15.568183 5123 exec.cpp:208] Executor registered on slave > 88734acc-718e-45b0-95b9-d8f07cea8a9e-S0 > Registered executor on ubuntu > Starting task 14b6bab9-9f60-4130-bdc4-44efba262bc6 > Forked command at 5132 > sh -c 'while true; do dd count=512 bs=1M if=/dev/zero of=./temp; done' > I1202 11:09:15.665498 5129 exec.cpp:254] Received reconnect request from > slave 88734acc-718e-45b0-95b9-d8f07cea8a9e-S0 > I1202 11:09:15.670995 5123 exec.cpp:381] Executor asked to shutdown > Shutting down > Sending SIGTERM to process tree at pid 5132 > ../../src/tests/containerizer/memory_pressure_tests.cpp:283: Failure > (usage).failure(): Unknown container: ebe90e15-72fa-4519-837b-62f43052c913 > *** Aborted at 1449083355 (unix time) try "date -d @1449083355" if you are > using GNU date *** > {code} > Notice that in the failed test, the executor is asked to shutdown when it > tries to reconnect to the agent. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-4738) Expose egress bandwidth as a resource
[ https://issues.apache.org/jira/browse/MESOS-4738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15163362#comment-15163362 ] Avinash Sridharan edited comment on MESOS-4738 at 2/24/16 5:30 PM: --- I agree with [~w013ccw] on this, we don't have a clean way of putting ingress bandwidth limits and hence should focus on egress. As far as treating egress bandwidth as a resource, we can treat this resource as a minimal rate guarantee rather than a fixed rate allocation. Using a DRR scheduler (https://en.wikipedia.org/wiki/Deficit_round_robin) on the egress rates should guarantee a minimum rate to each container . I think qdisc does support DRR (http://manpages.ubuntu.com/manpages/raring/man8/tc-drr.8.html) . was (Author: avin...@mesosphere.io): I agree with [~w013ccw] on this, we don't have a clean way of putting ingress bandwidth limits and hence should focus on egress. As far as treating egress bandwidth as a resource, we can treat this resource as a minimal rate guarantee rather than a fixed rate allocation. Using a DRR scheduler (https://en.wikipedia.org/wiki/Deficit_round_robin) on the egress rates should guarantee a minimum rate to each container . I think qdisc does support DRR. > Expose egress bandwidth as a resource > - > > Key: MESOS-4738 > URL: https://issues.apache.org/jira/browse/MESOS-4738 > Project: Mesos > Issue Type: Improvement >Reporter: Sargun Dhillon >Priority: Minor > Labels: mesosphere > > Some of our users care about variable network network isolation. Although we > cannot fundamentally limit ingress network bandwidth, having it as a > resource, so we can drop packets above a specific limit would be attractive. > It would be nice to expose egress and ingress bandwidth as an agent resource, > perhaps with a default of 10,000 mbps, and we can allow people to adjust as > needed. Alternatively, a more advanced design would involve generating > heuristics based on an analysis of the network MII / PHY. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4738) Expose egress bandwidth as a resource
[ https://issues.apache.org/jira/browse/MESOS-4738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15163362#comment-15163362 ] Avinash Sridharan commented on MESOS-4738: -- I agree with [~w013ccw] on this, we don't have a clean way of putting ingress bandwidth limits and hence should focus on egress. As far as treating egress bandwidth as a resource, we can treat this resource as a minimal rate guarantee rather than a fixed rate allocation. Using a DRR scheduler (https://en.wikipedia.org/wiki/Deficit_round_robin) on the egress rates should guarantee a minimum rate to each container . I think qdisc does support DRR. > Expose egress bandwidth as a resource > - > > Key: MESOS-4738 > URL: https://issues.apache.org/jira/browse/MESOS-4738 > Project: Mesos > Issue Type: Improvement >Reporter: Sargun Dhillon >Priority: Minor > Labels: mesosphere > > Some of our users care about variable network network isolation. Although we > cannot fundamentally limit ingress network bandwidth, having it as a > resource, so we can drop packets above a specific limit would be attractive. > It would be nice to expose egress and ingress bandwidth as an agent resource, > perhaps with a default of 10,000 mbps, and we can allow people to adjust as > needed. Alternatively, a more advanced design would involve generating > heuristics based on an analysis of the network MII / PHY. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4492) Add metrics for {RESERVE, UNRESERVE} and {CREATE, DESTROY} offer operation
[ https://issues.apache.org/jira/browse/MESOS-4492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15163353#comment-15163353 ] haosdent commented on MESOS-4492: - Keep in mind we also could RESERVE and UNRESERVE through http endpoints. Need track them as well. > Add metrics for {RESERVE, UNRESERVE} and {CREATE, DESTROY} offer operation > -- > > Key: MESOS-4492 > URL: https://issues.apache.org/jira/browse/MESOS-4492 > Project: Mesos > Issue Type: Improvement > Components: master >Reporter: Fan Du >Assignee: Fan Du >Priority: Minor > > This ticket aims to enable user or operator to inspect operation statistics > such as RESERVE, UNRESERVE, CREATE and DESTROY, current implementation only > supports LAUNCH. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4492) Add metrics for {RESERVE, UNRESERVE} and {CREATE, DESTROY} offer operation
[ https://issues.apache.org/jira/browse/MESOS-4492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15163335#comment-15163335 ] haosdent commented on MESOS-4492: - Seems you not submit the patch yet? > Add metrics for {RESERVE, UNRESERVE} and {CREATE, DESTROY} offer operation > -- > > Key: MESOS-4492 > URL: https://issues.apache.org/jira/browse/MESOS-4492 > Project: Mesos > Issue Type: Improvement > Components: master >Reporter: Fan Du >Assignee: Fan Du >Priority: Minor > > This ticket aims to enable user or operator to inspect operation statistics > such as RESERVE, UNRESERVE, CREATE and DESTROY, current implementation only > supports LAUNCH. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-4755) Update roleSorter when slave active/deactive
Klaus Ma created MESOS-4755: --- Summary: Update roleSorter when slave active/deactive Key: MESOS-4755 URL: https://issues.apache.org/jira/browse/MESOS-4755 Project: Mesos Issue Type: Bug Components: allocation Reporter: Klaus Ma Assignee: Klaus Ma Currently, the total resources of {{roleSorter}} are not updated when Agent active/deactive. It need to remove slave.total from roleSorter when deactive, and add it back when agent active again. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4747) ContainerLoggerTest.MesosContainerizerRecover cannot be executed in isolation
[ https://issues.apache.org/jira/browse/MESOS-4747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Bannier updated MESOS-4747: Sprint: Mesosphere Sprint 29 Labels: mesosphere (was: ) > ContainerLoggerTest.MesosContainerizerRecover cannot be executed in isolation > - > > Key: MESOS-4747 > URL: https://issues.apache.org/jira/browse/MESOS-4747 > Project: Mesos > Issue Type: Bug > Components: tests >Reporter: Benjamin Bannier >Assignee: Benjamin Bannier > Labels: mesosphere > Fix For: 0.28.0 > > > Some cleanup of spawned processes is missing in > {{ContainerLoggerTest.MesosContainerizerRecover}} so that when the test is > run in isolation the global teardown might find lingering processes. > {code} > [==] Running 1 test from 1 test case. > [--] Global test environment set-up. > [--] 1 test from ContainerLoggerTest > [ RUN ] ContainerLoggerTest.MesosContainerizerRecover > [ OK ] ContainerLoggerTest.MesosContainerizerRecover (13 ms) > [--] 1 test from ContainerLoggerTest (13 ms total) > [--] Global test environment tear-down > ../../src/tests/environment.cpp:728: Failure > Failed > Tests completed with child processes remaining: > -+- 7112 /SOME/PATH/src/mesos/build/src/.libs/mesos-tests > --gtest_filter=ContainerLoggerTest.MesosContainerizerRecover > \--- 7130 (sh) > [==] 1 test from 1 test case ran. (23 ms total) > [ PASSED ] 1 test. > [ FAILED ] 0 tests, listed below: > 0 FAILED TESTS > {code} > Observered on OS X with clang-trunk and an unoptimized build. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-4754) The "executors" field is exposed under a backwards incompatible schema.
[ https://issues.apache.org/jira/browse/MESOS-4754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15162684#comment-15162684 ] Michael Park edited comment on MESOS-4754 at 2/24/16 10:08 AM: --- The issue here is that even though {{src/common/http.cpp}} has a definition of {{void json(JSON::ObjectWriter* writer, const ExecutorInfo& executorInfo);}}, its declaration is missing from {{src/common/http.hpp}}. We would have liked this to cause a compiler error, but it didn't because of the generic {{json}} function for protobuf messages: {{inline void json(ObjectWriter* writer, const google::protobuf::Message& message)}}, which can jsonify {{ExecutorInfo}} using the protobuf schema. The resolution will be the following: 1. Add the missing declaration of {{void json(JSON::ObjectWriter* writer, const ExecutorInfo& executorInfo);}} to {{src/common/http.hpp}} 2. Make the generic {{json}} function that handles protobuf messages to require explicit opt-in. {code} -writer->field("cgroup_info", status.cgroup_info()); +writer->field("cgroup_info", JSON::Protobuf(status.cgroup_info())); {code} was (Author: mcypark): The issue here is that even though {{src/common/http.cpp}} has a definition of {{void json(JSON::ObjectWriter* writer, const ExecutorInfo& executorInfo);}}, its declaration is missing from {{src/common/http.hpp}}. We would have liked this to cause a compiler error, but it didn't because of the generic {{json}} function for protobuf messages: {{inline void json(ObjectWriter* writer, const google::protobuf::Message& message)}}, which can jsonify {{ExecutorInfo}} using the protobuf schema. The resolution will be the following: 1. Add the missing declaration of {{void json(JSON::ObjectWriter* writer, const ExecutorInfo& executorInfo);}} to {{src/common/http.hpp}} 2. Make the generic {{json}} function that handles protobuf messages to required explicit opt-in. {code} -writer->field("cgroup_info", status.cgroup_info()); +writer->field("cgroup_info", JSON::Protobuf(status.cgroup_info())); {code} > The "executors" field is exposed under a backwards incompatible schema. > --- > > Key: MESOS-4754 > URL: https://issues.apache.org/jira/browse/MESOS-4754 > Project: Mesos > Issue Type: Bug > Components: master >Reporter: Michael Park >Assignee: Michael Park > Labels: mesosphere > Fix For: 0.27.2 > > > In 0.26.0, the master's {{/state}} endpoint generated the following: > {code} > { > /* ... */ > "frameworks": [ > { > /* ... */ > "executors": [ > { > "command": { > "argv": [], > "uris": [], > "value": > "/Users/mpark/Projects/mesos/build/opt/src/long-lived-executor" > }, > "executor_id": "default", > "framework_id": "0ea528a9-64ba-417f-98ea-9c4b8d418db6-", > "name": "Long Lived Executor (C++)", > "resources": { > "cpus": 0, > "disk": 0, > "mem": 0 > }, > "slave_id": "8a513678-03a1-4cb5-9279-c3c0c591f1d8-S0" > } > ], > /* ... */ > } > ] > /* ... */ > } > {code} > In 0.27.1, the {{ExecutorInfo}} is mistakenly exposed in the raw protobuf > schema: > {code} > { > /* ... */ > "frameworks": [ > { > /* ... */ > "executors": [ > { > "command": { > "shell": true, > "value": > "/Users/mpark/Projects/mesos/build/opt/src/long-lived-executor" > }, > "executor_id": { > "value": "default" > }, > "framework_id": { > "value": "368a5a49-480b-41f6-a13b-24a69c92a72e-" > }, > "name": "Long Lived Executor (C++)", > "slave_id": "8a513678-03a1-4cb5-9279-c3c0c591f1d8-S0", > "source": "cpp_long_lived_framework" > } > ], > /* ... */ > } > ] > /* ... */ > } > {code} > This is a backwards incompatible API change. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-4754) The "executors" field is exposed under a backwards incompatible schema.
[ https://issues.apache.org/jira/browse/MESOS-4754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15162684#comment-15162684 ] Michael Park edited comment on MESOS-4754 at 2/24/16 10:07 AM: --- The issue here is that even though {{src/common/http.cpp}} has a definition of {{void json(JSON::ObjectWriter* writer, const ExecutorInfo& executorInfo);}}, its declaration is missing from {{src/common/http.hpp}}. We would have liked this to cause a compiler error, but it didn't because of the generic {{json}} function for protobuf messages: {{inline void json(ObjectWriter* writer, const google::protobuf::Message& message)}}, which can jsonify {{ExecutorInfo}} using the protobuf schema. The resolution will be the following: 1. Add the missing declaration of {{void json(JSON::ObjectWriter* writer, const ExecutorInfo& executorInfo);}} to {{src/common/http.hpp}} 2. Make the generic {{json}} function that handles protobuf messages to required explicit opt-in. {code} -writer->field("cgroup_info", status.cgroup_info()); +writer->field("cgroup_info", JSON::Protobuf(status.cgroup_info())); {code} was (Author: mcypark): The issue here is that even though {{src/common/http.cpp}} has a definition of {{void json(JSON::ObjectWriter* writer, const ExecutorInfo& executorInfo);}}, its declaration is missing from {{src/common/http.hpp}}. This should have caused a compiler error, but it did not because we provide a generic {{json}} function for protobuf messages: {{inline void json(ObjectWriter* writer, const google::protobuf::Message& message)}}, which can jsonify {{ExecutorInfo}} using the protobuf schema. The resolution will be the following: 1. Add the missing declaration of {{void json(JSON::ObjectWriter* writer, const ExecutorInfo& executorInfo);}} to {{src/common/http.hpp}} 2. Make the generic {{json}} function that handles protobuf messages to required explicit opt-in. {code} -writer->field("cgroup_info", status.cgroup_info()); +writer->field("cgroup_info", JSON::Protobuf(status.cgroup_info())); {code} > The "executors" field is exposed under a backwards incompatible schema. > --- > > Key: MESOS-4754 > URL: https://issues.apache.org/jira/browse/MESOS-4754 > Project: Mesos > Issue Type: Bug > Components: master >Reporter: Michael Park >Assignee: Michael Park > Labels: mesosphere > Fix For: 0.27.2 > > > In 0.26.0, the master's {{/state}} endpoint generated the following: > {code} > { > /* ... */ > "frameworks": [ > { > /* ... */ > "executors": [ > { > "command": { > "argv": [], > "uris": [], > "value": > "/Users/mpark/Projects/mesos/build/opt/src/long-lived-executor" > }, > "executor_id": "default", > "framework_id": "0ea528a9-64ba-417f-98ea-9c4b8d418db6-", > "name": "Long Lived Executor (C++)", > "resources": { > "cpus": 0, > "disk": 0, > "mem": 0 > }, > "slave_id": "8a513678-03a1-4cb5-9279-c3c0c591f1d8-S0" > } > ], > /* ... */ > } > ] > /* ... */ > } > {code} > In 0.27.1, the {{ExecutorInfo}} is mistakenly exposed in the raw protobuf > schema: > {code} > { > /* ... */ > "frameworks": [ > { > /* ... */ > "executors": [ > { > "command": { > "shell": true, > "value": > "/Users/mpark/Projects/mesos/build/opt/src/long-lived-executor" > }, > "executor_id": { > "value": "default" > }, > "framework_id": { > "value": "368a5a49-480b-41f6-a13b-24a69c92a72e-" > }, > "name": "Long Lived Executor (C++)", > "slave_id": "8a513678-03a1-4cb5-9279-c3c0c591f1d8-S0", > "source": "cpp_long_lived_framework" > } > ], > /* ... */ > } > ] > /* ... */ > } > {code} > This is a backwards incompatible API change. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4743) Mesos fetcher not working correctly on docker apps on CoreOS
[ https://issues.apache.org/jira/browse/MESOS-4743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15162717#comment-15162717 ] Guillermo Rodriguez commented on MESOS-4743: Yes it is running inside a container. I installed 0.27.1 but still the same. I hope MEOS-4249 fixes it. Thanks!!! > Mesos fetcher not working correctly on docker apps on CoreOS > > > Key: MESOS-4743 > URL: https://issues.apache.org/jira/browse/MESOS-4743 > Project: Mesos > Issue Type: Bug > Components: docker, fetcher >Affects Versions: 0.26.0 >Reporter: Guillermo Rodriguez > > I initially sent this issue to the Marathon group. They asked me to send it > here. This is the original thread: > https://github.com/mesosphere/marathon/issues/3179 > Then they closed it so I had to ask again with more proof. > https://github.com/mesosphere/marathon/issues/3213 > In a nutshell, when I start a Marathon task that uses the URI while running > on CoreOS. The file is effectively fetched but not passed to the container. I > can see the file in the mesos UI but the file is not in the container. It is, > however, downloaded to another folder. > It is very simple to test. The original ticket has two files attaches with a > Marathon JSON for a Prometheus server and a prometheus.yml config file. The > objective is to start prometheus with the config file. > CoreOS 899.6 > Mesos 0.26 > Marathon 0.15.2 > Thanks! -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4753) Add executor state when reporting resource usage
[ https://issues.apache.org/jira/browse/MESOS-4753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fan Du updated MESOS-4753: -- Description: Slave reports resource usage of each executor for resource estimator to feed master with revocable resource, it's better to append executor state as well when reporting usage, which in turn resource estimator would easily focus on the *RUNNING* executor only. it's possible to call {{Slave:: getExecutor}} in estimator, but it's possible not sync up with the resource usage. was: Slave reports resource usage of each executor for resource estimator to feed master with revocable resource, it's better to append executor state as well when reporting usage, which in turn resource estimator would easily focus on the *RUNNING* executor only. it's possible to call {code} Slave:: getExecutor {code} in estimator, but it's possible not sync up with the resource usage. > Add executor state when reporting resource usage > > > Key: MESOS-4753 > URL: https://issues.apache.org/jira/browse/MESOS-4753 > Project: Mesos > Issue Type: Improvement > Components: slave, statistics >Reporter: Fan Du >Assignee: Fan Du >Priority: Minor > > Slave reports resource usage of each executor for resource estimator to feed > master with revocable resource, it's better to append executor state as well > when reporting usage, which in turn resource estimator would easily focus on > the *RUNNING* executor only. > it's possible to call {{Slave:: getExecutor}} in estimator, but it's possible > not sync up with the resource usage. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4754) The "executors" field is exposed under a backwards incompatible schema.
[ https://issues.apache.org/jira/browse/MESOS-4754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15162684#comment-15162684 ] Michael Park commented on MESOS-4754: - The issue here is that even though {{src/common/http.cpp}} has a definition of {{void json(JSON::ObjectWriter* writer, const ExecutorInfo& executorInfo);}}, its declaration is missing from {{src/common/http.hpp}}. This should have caused a compiler error, but it did not because we provide a generic {{json}} function for protobuf messages: {{inline void json(ObjectWriter* writer, const google::protobuf::Message& message)}}, which can jsonify {{ExecutorInfo}} using the protobuf schema. The resolution will be the following: 1. Add the missing declaration of {{void json(JSON::ObjectWriter* writer, const ExecutorInfo& executorInfo);}} to {{src/common/http.hpp}} 2. Make the generic {{json}} function that handles protobuf messages to required explicit opt-in. {code} -writer->field("cgroup_info", status.cgroup_info()); +writer->field("cgroup_info", JSON::Protobuf(status.cgroup_info())); {code} > The "executors" field is exposed under a backwards incompatible schema. > --- > > Key: MESOS-4754 > URL: https://issues.apache.org/jira/browse/MESOS-4754 > Project: Mesos > Issue Type: Bug > Components: master >Reporter: Michael Park >Assignee: Michael Park > Labels: mesosphere > Fix For: 0.27.2 > > > In 0.26.0, the master's {{/state}} endpoint generated the following: > {code} > { > /* ... */ > "frameworks": [ > { > /* ... */ > "executors": [ > { > "command": { > "argv": [], > "uris": [], > "value": > "/Users/mpark/Projects/mesos/build/opt/src/long-lived-executor" > }, > "executor_id": "default", > "framework_id": "0ea528a9-64ba-417f-98ea-9c4b8d418db6-", > "name": "Long Lived Executor (C++)", > "resources": { > "cpus": 0, > "disk": 0, > "mem": 0 > }, > "slave_id": "8a513678-03a1-4cb5-9279-c3c0c591f1d8-S0" > } > ], > /* ... */ > } > ] > /* ... */ > } > {code} > In 0.27.1, the {{ExecutorInfo}} is mistakenly exposed in the raw protobuf > schema: > {code} > { > /* ... */ > "frameworks": [ > { > /* ... */ > "executors": [ > { > "command": { > "shell": true, > "value": > "/Users/mpark/Projects/mesos/build/opt/src/long-lived-executor" > }, > "executor_id": { > "value": "default" > }, > "framework_id": { > "value": "368a5a49-480b-41f6-a13b-24a69c92a72e-" > }, > "name": "Long Lived Executor (C++)", > "slave_id": "8a513678-03a1-4cb5-9279-c3c0c591f1d8-S0", > "source": "cpp_long_lived_framework" > } > ], > /* ... */ > } > ] > /* ... */ > } > {code} > This is a backwards incompatible API change. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-4754) The "executors" field is exposed under a backwards incompatible schema.
Michael Park created MESOS-4754: --- Summary: The "executors" field is exposed under a backwards incompatible schema. Key: MESOS-4754 URL: https://issues.apache.org/jira/browse/MESOS-4754 Project: Mesos Issue Type: Bug Components: master Reporter: Michael Park Assignee: Michael Park Fix For: 0.27.2 In 0.26.0, the master's {{/state}} endpoint generated the following: {code} { /* ... */ "frameworks": [ { /* ... */ "executors": [ { "command": { "argv": [], "uris": [], "value": "/Users/mpark/Projects/mesos/build/opt/src/long-lived-executor" }, "executor_id": "default", "framework_id": "0ea528a9-64ba-417f-98ea-9c4b8d418db6-", "name": "Long Lived Executor (C++)", "resources": { "cpus": 0, "disk": 0, "mem": 0 }, "slave_id": "8a513678-03a1-4cb5-9279-c3c0c591f1d8-S0" } ], /* ... */ } ] /* ... */ } {code} In 0.27.1, the {{ExecutorInfo}} is mistakenly exposed in the raw protobuf schema: {code} { /* ... */ "frameworks": [ { /* ... */ "executors": [ { "command": { "shell": true, "value": "/Users/mpark/Projects/mesos/build/opt/src/long-lived-executor" }, "executor_id": { "value": "default" }, "framework_id": { "value": "368a5a49-480b-41f6-a13b-24a69c92a72e-" }, "name": "Long Lived Executor (C++)", "slave_id": "8a513678-03a1-4cb5-9279-c3c0c591f1d8-S0", "source": "cpp_long_lived_framework" } ], /* ... */ } ] /* ... */ } {code} This is a backwards incompatible API change. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4747) ContainerLoggerTest.MesosContainerizerRecover cannot be executed in isolation
[ https://issues.apache.org/jira/browse/MESOS-4747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Bannier updated MESOS-4747: Shepherd: Adam B > ContainerLoggerTest.MesosContainerizerRecover cannot be executed in isolation > - > > Key: MESOS-4747 > URL: https://issues.apache.org/jira/browse/MESOS-4747 > Project: Mesos > Issue Type: Bug > Components: tests >Reporter: Benjamin Bannier >Assignee: Benjamin Bannier > Fix For: 0.28.0 > > > Some cleanup of spawned processes is missing in > {{ContainerLoggerTest.MesosContainerizerRecover}} so that when the test is > run in isolation the global teardown might find lingering processes. > {code} > [==] Running 1 test from 1 test case. > [--] Global test environment set-up. > [--] 1 test from ContainerLoggerTest > [ RUN ] ContainerLoggerTest.MesosContainerizerRecover > [ OK ] ContainerLoggerTest.MesosContainerizerRecover (13 ms) > [--] 1 test from ContainerLoggerTest (13 ms total) > [--] Global test environment tear-down > ../../src/tests/environment.cpp:728: Failure > Failed > Tests completed with child processes remaining: > -+- 7112 /SOME/PATH/src/mesos/build/src/.libs/mesos-tests > --gtest_filter=ContainerLoggerTest.MesosContainerizerRecover > \--- 7130 (sh) > [==] 1 test from 1 test case ran. (23 ms total) > [ PASSED ] 1 test. > [ FAILED ] 0 tests, listed below: > 0 FAILED TESTS > {code} > Observered on OS X with clang-trunk and an unoptimized build. -- This message was sent by Atlassian JIRA (v6.3.4#6332)