[jira] [Created] (MESOS-8745) Add a `LIST_RESOURCE_PROVIDER_CONFIGS` agent API call.
Chun-Hung Hsiao created MESOS-8745: -- Summary: Add a `LIST_RESOURCE_PROVIDER_CONFIGS` agent API call. Key: MESOS-8745 URL: https://issues.apache.org/jira/browse/MESOS-8745 Project: Mesos Issue Type: Task Components: agent Reporter: Chun-Hung Hsiao Assignee: Chun-Hung Hsiao For API completeness, it would be nice if we can provider a call to list all valid resource provider configs on an agent. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-8744) Test `ForwardUpdateSlaveMessage` flaky.
Meng Zhu created MESOS-8744: --- Summary: Test `ForwardUpdateSlaveMessage` flaky. Key: MESOS-8744 URL: https://issues.apache.org/jira/browse/MESOS-8744 Project: Mesos Issue Type: Bug Reporter: Meng Zhu Attachments: Badrun_ForwardUpdateSlaveMessage.txt -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-8743) Fix Mesos 1.5.x upgrade doc for allocator module interface changes.
Gilbert Song created MESOS-8743: --- Summary: Fix Mesos 1.5.x upgrade doc for allocator module interface changes. Key: MESOS-8743 URL: https://issues.apache.org/jira/browse/MESOS-8743 Project: Mesos Issue Type: Documentation Components: allocation, documentation Reporter: Gilbert Song Update the 1.5.x upgrade doc for the recent allocator module API changes: * https://github.com/apache/mesos/blame/master/include/mesos/allocator/allocator.hpp#L288 * https://github.com/apache/mesos/commit/9015cd316bf6d185363cd0caf1705e2fb118ed63#diff-e8f9b112d5e3ed340294e42fa4fc0a6e -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-8742) Agent resource provider config API calls should be idempotent.
Chun-Hung Hsiao created MESOS-8742: -- Summary: Agent resource provider config API calls should be idempotent. Key: MESOS-8742 URL: https://issues.apache.org/jira/browse/MESOS-8742 Project: Mesos Issue Type: Bug Reporter: Chun-Hung Hsiao Assignee: Chun-Hung Hsiao There are some issues w.r.t. using the current agent resource provider config API calls: 1. {{UPDATE_RESOURCE_PROVIDER_CONFIG}}: If the caller fail to receive the HTTP response code, there is no way to retry the operation without triggering an RP restart. 2. {{ REMOVE_RESOURCE_PROVIDER_CONFIG}}: Due to MESOS-7697, if the caller fail to receive the HTTP response code, a retry will return a 404 Not Found. But due to MESOS-7697, there is no way for the caller to know if the 404 is due to a previous successful config removal or not. To address these issues, we should make these calls idempotent, such that they return 200 OK when the caller retry. It would be nice if {{ADD_RESOURCE_PROVIDER_CONFIG}} is also idempotent for consistency. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8729) Libprocess: deadlock in process::finalize
[ https://issues.apache.org/jira/browse/MESOS-8729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16416273#comment-16416273 ] Benjamin Mahler commented on MESOS-8729: A couple of additional finalization related issues I noticed while trying to write a small test to reproduce this: (1) re-initialization after finalize is not supported and crashes (2) things like spawn() implicitly re-initialize and therefore if things like spawn get called post-finalize() they will crash (3) even without implicit initialization, a spawn will access the process manager pointer and crash (4) double-finalize() will crash the program. Before resolving this particular bug, will create an epic to track these other issues if one doesn't already exist. cc [~kaysoky] > Libprocess: deadlock in process::finalize > - > > Key: MESOS-8729 > URL: https://issues.apache.org/jira/browse/MESOS-8729 > Project: Mesos > Issue Type: Bug > Components: libprocess >Affects Versions: 1.6.0 > Environment: The issue has been reproduced on Ubuntu 16.04, master > branch, commit `42848653b2`. >Reporter: Andrei Budnik >Priority: Major > Labels: deadlock, libprocess > Attachments: deadlock.txt > > > Since we are calling > [`libprocess::finalize()`|https://github.com/apache/mesos/blob/02ebf9986ab5ce883a71df72e9e3392a3e37e40e/src/slave/containerizer/mesos/io/switchboard_main.cpp#L157] > before returning from the IOSwitchboard's main function, we expect that all > http responses are going to be sent back to clients before IOSwitchboard > terminates. However, after [adding|https://reviews.apache.org/r/66147/] > `libprocess::finalize()` we have seen that IOSwitchboard might get stuck in > `libprocess::finalize()`. See attached stacktrace. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-8741) `Add` to sequence will not run if it races with sequence destruction
Meng Zhu created MESOS-8741: --- Summary: `Add` to sequence will not run if it races with sequence destruction Key: MESOS-8741 URL: https://issues.apache.org/jira/browse/MESOS-8741 Project: Mesos Issue Type: Bug Environment: Reporter: Meng Zhu Assignee: Meng Zhu Adding item to sequence is realized by dispatching `add()` to the sequence actor. However, this could race with the sequence actor destruction.: After the dispatch but before the dispatched `add()` message gets processed by the sequence actor, if the sequence gets destroyed, a terminate message will be injected to the *head* of the message queue. This would result in the destruction of the sequence without the `add()` call ever gets processed. User would end up with a pending future and the future's `onDiscarded' would not be triggered during the sequence destruction. The solution is to set the `inject` flag to `false` so that the terminating message is enqueued to the end of the sequence actor message queue. All `add()` messages that happen before the destruction will be processed before the terminating message. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-8740) Update description of a Containerizer interface.
Andrei Budnik created MESOS-8740: Summary: Update description of a Containerizer interface. Key: MESOS-8740 URL: https://issues.apache.org/jira/browse/MESOS-8740 Project: Mesos Issue Type: Documentation Reporter: Andrei Budnik [Containerizer interface|https://github.com/apache/mesos/blob/master/src/slave/containerizer/containerizer.hpp] must be updated with respect to the latest changes. In addition, it should clearly describe semantics of `wait()` and `destroy()` methods, including cases with a nested containers. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8534) Allow nested containers in TaskGroups to have separate network namespaces
[ https://issues.apache.org/jira/browse/MESOS-8534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16415995#comment-16415995 ] ASF GitHub Bot commented on MESOS-8534: --- Github user sagar8192 closed the pull request at: https://github.com/apache/mesos/pull/263 > Allow nested containers in TaskGroups to have separate network namespaces > - > > Key: MESOS-8534 > URL: https://issues.apache.org/jira/browse/MESOS-8534 > Project: Mesos > Issue Type: Task > Components: containerization >Reporter: Sagar Sadashiv Patwardhan >Assignee: Sagar Sadashiv Patwardhan >Priority: Minor > Labels: cni > Fix For: 1.6.0 > > > As per the discussion with [~jieyu] and [~avinash.mesos] , I am going to > allow nested containers in TaskGroups to have separate namespaces. I am also > going to retain the existing functionality, where nested containers can share > namespaces with the parent/root container. > *Use case:* At Yelp, we have this application called seagull that runs > multiple tasks in parallel. It is mainly used for running tests that depend > on other containerized internal microservices. It was developed before mesos > had support for docker-executor. So, it uses a custom executor, which > directly talks to docker daemon on the host and run a bunch of service > containers along with the process where tests are executed. Resources for all > these containers are not accounted for in mesos. Clean-up of these containers > is also a headache. We have a tool called docker-reaper that automatically > reaps the orphaned containers once the executor goes away. In addition to > that, we also run a few cron jobs that clean-up any leftover containers. > We are in the process of containerizing the process that runs the tests. We > also want to delegate the responsibility of lifecycle management of docker > containers to mesos and get rid of the custom executor. We looked at a few > alternatives to do this and decided to go with pods because they provide > all-or-nothing(atomicity) semantics that we need for our application. But, we > cannot use pods directly because all the containers in a pod have the same > network namespace. The service discovery mechanism requires all the > containers to have separate IPs. All of our microservices bind to > container port, so we will have port collision unless we are giving separate > namespaces to all the containers in a pod. > *Proposal:* I am planning to allow nested containers to have separate > namespaces. If NetworkInfo protobuf for nested containers is not empty, then > we will assign separate mnt and network namespaces to the nested containers. > Otherwise, they will share the network and mount namepsaces with the > parent/root container. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8534) Allow nested containers in TaskGroups to have separate network namespaces
[ https://issues.apache.org/jira/browse/MESOS-8534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16415994#comment-16415994 ] ASF GitHub Bot commented on MESOS-8534: --- Github user sagar8192 commented on the issue: https://github.com/apache/mesos/pull/263 Closed in favor of https://reviews.apache.org/r/65987/ > Allow nested containers in TaskGroups to have separate network namespaces > - > > Key: MESOS-8534 > URL: https://issues.apache.org/jira/browse/MESOS-8534 > Project: Mesos > Issue Type: Task > Components: containerization >Reporter: Sagar Sadashiv Patwardhan >Assignee: Sagar Sadashiv Patwardhan >Priority: Minor > Labels: cni > Fix For: 1.6.0 > > > As per the discussion with [~jieyu] and [~avinash.mesos] , I am going to > allow nested containers in TaskGroups to have separate namespaces. I am also > going to retain the existing functionality, where nested containers can share > namespaces with the parent/root container. > *Use case:* At Yelp, we have this application called seagull that runs > multiple tasks in parallel. It is mainly used for running tests that depend > on other containerized internal microservices. It was developed before mesos > had support for docker-executor. So, it uses a custom executor, which > directly talks to docker daemon on the host and run a bunch of service > containers along with the process where tests are executed. Resources for all > these containers are not accounted for in mesos. Clean-up of these containers > is also a headache. We have a tool called docker-reaper that automatically > reaps the orphaned containers once the executor goes away. In addition to > that, we also run a few cron jobs that clean-up any leftover containers. > We are in the process of containerizing the process that runs the tests. We > also want to delegate the responsibility of lifecycle management of docker > containers to mesos and get rid of the custom executor. We looked at a few > alternatives to do this and decided to go with pods because they provide > all-or-nothing(atomicity) semantics that we need for our application. But, we > cannot use pods directly because all the containers in a pod have the same > network namespace. The service discovery mechanism requires all the > containers to have separate IPs. All of our microservices bind to > container port, so we will have port collision unless we are giving separate > namespaces to all the containers in a pod. > *Proposal:* I am planning to allow nested containers to have separate > namespaces. If NetworkInfo protobuf for nested containers is not empty, then > we will assign separate mnt and network namespaces to the nested containers. > Otherwise, they will share the network and mount namepsaces with the > parent/root container. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-8739) Implement a test to check that a launched container can be killed.
Andrei Budnik created MESOS-8739: Summary: Implement a test to check that a launched container can be killed. Key: MESOS-8739 URL: https://issues.apache.org/jira/browse/MESOS-8739 Project: Mesos Issue Type: Task Reporter: Andrei Budnik This test launches a long-running task, then calls `wait()` and `destroy()` methods of the composing containerizer. Both termination statuses must be equal. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-8738) Implement a test to check that a recovered container can be killed.
Andrei Budnik created MESOS-8738: Summary: Implement a test to check that a recovered container can be killed. Key: MESOS-8738 URL: https://issues.apache.org/jira/browse/MESOS-8738 Project: Mesos Issue Type: Task Reporter: Andrei Budnik This test verifies that a recovered container can be killed via `destroy()` method of composing containerizer. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (MESOS-7697) Mesos scheduler v1 HTTP API may generate 404 errors for temporary conditions
[ https://issues.apache.org/jira/browse/MESOS-7697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16415941#comment-16415941 ] James DeFelice edited comment on MESOS-7697 at 3/27/18 5:14 PM: [https://github.com/apache/mesos/blob/124c677c86c7b12ca4568f004895b8ca30d60dcf/3rdparty/libprocess/src/process.cpp#L3674] https://github.com/apache/mesos/blob/124c677c86c7b12ca4568f004895b8ca30d60dcf/3rdparty/libprocess/src/process.cpp#L2729-L2740 was (Author: jdef): [https://github.com/apache/mesos/blob/124c677c86c7b12ca4568f004895b8ca30d60dcf/3rdparty/libprocess/src/process.cpp#L3674] https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/process.cpp#L2729-L2740 > Mesos scheduler v1 HTTP API may generate 404 errors for temporary conditions > > > Key: MESOS-7697 > URL: https://issues.apache.org/jira/browse/MESOS-7697 > Project: Mesos > Issue Type: Bug > Components: HTTP API, libprocess >Reporter: James DeFelice >Priority: Major > Labels: mesosphere > > Returning a 404 error for a condition that's a known temporary condition is > confusing from a client's perspective. A client wants to know how to recover > from various error conditions. A 404 error condition should be distinct from > a "server is not yet ready, but will be shortly" condition (which should > probably be reported as a 503 "unavailable" error). > https://github.com/apache/mesos/blob/72752fc6deb8ebcbfbd5448dc599ef3774339d31/src/scheduler/scheduler.cpp#L593 > {code} > if (response->code == process::http::Status::NOT_FOUND) { > // This could happen if the master libprocess process has not yet set up > // HTTP routes. > LOG(WARNING) << "Received '" << response->status << "' (" ><< response->body << ") for " << call.type(); > return; > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (MESOS-7697) Mesos scheduler v1 HTTP API may generate 404 errors for temporary conditions
[ https://issues.apache.org/jira/browse/MESOS-7697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16415941#comment-16415941 ] James DeFelice edited comment on MESOS-7697 at 3/27/18 5:13 PM: [https://github.com/apache/mesos/blob/124c677c86c7b12ca4568f004895b8ca30d60dcf/3rdparty/libprocess/src/process.cpp#L3674] https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/process.cpp#L2729-L2740 was (Author: jdef): https://github.com/apache/mesos/blob/124c677c86c7b12ca4568f004895b8ca30d60dcf/3rdparty/libprocess/src/process.cpp#L3674 > Mesos scheduler v1 HTTP API may generate 404 errors for temporary conditions > > > Key: MESOS-7697 > URL: https://issues.apache.org/jira/browse/MESOS-7697 > Project: Mesos > Issue Type: Bug > Components: HTTP API, libprocess >Reporter: James DeFelice >Priority: Major > Labels: mesosphere > > Returning a 404 error for a condition that's a known temporary condition is > confusing from a client's perspective. A client wants to know how to recover > from various error conditions. A 404 error condition should be distinct from > a "server is not yet ready, but will be shortly" condition (which should > probably be reported as a 503 "unavailable" error). > https://github.com/apache/mesos/blob/72752fc6deb8ebcbfbd5448dc599ef3774339d31/src/scheduler/scheduler.cpp#L593 > {code} > if (response->code == process::http::Status::NOT_FOUND) { > // This could happen if the master libprocess process has not yet set up > // HTTP routes. > LOG(WARNING) << "Received '" << response->status << "' (" ><< response->body << ") for " << call.type(); > return; > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-7697) Mesos scheduler v1 HTTP API may generate 404 errors for temporary conditions
[ https://issues.apache.org/jira/browse/MESOS-7697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16415941#comment-16415941 ] James DeFelice commented on MESOS-7697: --- https://github.com/apache/mesos/blob/124c677c86c7b12ca4568f004895b8ca30d60dcf/3rdparty/libprocess/src/process.cpp#L3674 > Mesos scheduler v1 HTTP API may generate 404 errors for temporary conditions > > > Key: MESOS-7697 > URL: https://issues.apache.org/jira/browse/MESOS-7697 > Project: Mesos > Issue Type: Bug > Components: HTTP API, libprocess >Reporter: James DeFelice >Priority: Major > Labels: mesosphere > > Returning a 404 error for a condition that's a known temporary condition is > confusing from a client's perspective. A client wants to know how to recover > from various error conditions. A 404 error condition should be distinct from > a "server is not yet ready, but will be shortly" condition (which should > probably be reported as a 503 "unavailable" error). > https://github.com/apache/mesos/blob/72752fc6deb8ebcbfbd5448dc599ef3774339d31/src/scheduler/scheduler.cpp#L593 > {code} > if (response->code == process::http::Status::NOT_FOUND) { > // This could happen if the master libprocess process has not yet set up > // HTTP routes. > LOG(WARNING) << "Received '" << response->status << "' (" ><< response->body << ") for " << call.type(); > return; > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-8737) Implement a test to check recovery of a composing containerizer.
Andrei Budnik created MESOS-8737: Summary: Implement a test to check recovery of a composing containerizer. Key: MESOS-8737 URL: https://issues.apache.org/jira/browse/MESOS-8737 Project: Mesos Issue Type: Task Reporter: Andrei Budnik This test verifies that if a recovered container terminates, then it is cleaned up and removed from `containers_` hash map in the composing containerizer. This test should verify above-mentioned property using `containers()` method and should not use `destroy()` method. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-8736) Implement a test which ensures that `wait` and `destroy` return the same result for a terminated nested container.
Andrei Budnik created MESOS-8736: Summary: Implement a test which ensures that `wait` and `destroy` return the same result for a terminated nested container. Key: MESOS-8736 URL: https://issues.apache.org/jira/browse/MESOS-8736 Project: Mesos Issue Type: Task Reporter: Andrei Budnik This test launches a nested container using a composing containerizer, then checks that calling `destroy()` after `wait()` returns the same non-empty container termination status as for `wait()`. After that, it kills parent container and checks that both `destroy()` and `wait()` return an empty termination status. Note that this test uses only Composing c'zer. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8733) OversubscriptionTest.ForwardUpdateSlaveMessage is flaky
[ https://issues.apache.org/jira/browse/MESOS-8733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16415841#comment-16415841 ] Benjamin Bannier commented on MESOS-8733: - This seems to be related to the agent registering multiple times in these tests since the randomly choosen backoff factor was pick too low (similar to e.g., the issue reported in MESOS-8613). A more detail log of a failure shows how small backoffs cause multiple registration attempts by the agent, {noformat} [ RUN ] OversubscriptionTest.ForwardUpdateSlaveMessage I0327 17:59:30.498467 14537 cluster.cpp:172] Creating default 'local' authorizer I0327 17:59:30.517834 14564 master.cpp:463] Master f201dade-c73e-42ab-8379-fde33e1d6b29 (gru1.hw.ca1.mesosphere.com) started on 192.99.40.208:39245 I0327 17:59:30.517881 14564 master.cpp:466] Flags at startup: --acls="" --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" --allocation_interval="1secs" --allocator="HierarchicalDRF" --authenticate_agents="true" --authenticate_frameworks="true" --authenticate_http_frameworks="true" --authenticate_http_readonly="true" --authenticate_http_readwrite="true" --authenticators="crammd5" --authorizers="local" --credentials="/tmp/5vjDkq/credentials" --filter_gpu_resources="true" --framework_sorter="drf" --help="false" --hostname_lookup="true" --http_authenticators="basic" --http_framework_authenticators="basic" --initialize_driver_logging="true" --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" --max_unreachable_tasks_per_framework="1000" --port="5050" --quiet="false" --recovery_agent_removal_limit="100%" --registry="in_memory" --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" --registry_store_timeout="100secs" --registry_strict="false" --require_agent_domain="false" --root_submissions="true" --user_sorter="drf" --version="false" --webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/5vjDkq/master" --zk_session_timeout="10secs" I0327 17:59:30.518154 14564 master.cpp:515] Master only allowing authenticated frameworks to register I0327 17:59:30.518162 14564 master.cpp:521] Master only allowing authenticated agents to register I0327 17:59:30.518167 14564 master.cpp:527] Master only allowing authenticated HTTP frameworks to register I0327 17:59:30.518172 14564 credentials.hpp:37] Loading credentials for authentication from '/tmp/5vjDkq/credentials' I0327 17:59:30.518376 14564 master.cpp:571] Using default 'crammd5' authenticator I0327 17:59:30.518471 14564 http.cpp:959] Creating default 'basic' HTTP authenticator for realm 'mesos-master-readonly' I0327 17:59:30.518566 14564 http.cpp:959] Creating default 'basic' HTTP authenticator for realm 'mesos-master-readwrite' I0327 17:59:30.518637 14564 http.cpp:959] Creating default 'basic' HTTP authenticator for realm 'mesos-master-scheduler' I0327 17:59:30.518702 14564 master.cpp:652] Authorization enabled I0327 17:59:30.520488 14559 whitelist_watcher.cpp:77] No whitelist given I0327 17:59:30.520504 14568 hierarchical.cpp:175] Initialized hierarchical allocator process I0327 17:59:30.522081 14568 master.cpp:2126] Elected as the leading master! I0327 17:59:30.522099 14568 master.cpp:1682] Recovering from registrar I0327 17:59:30.522241 14568 registrar.cpp:347] Recovering registrar I0327 17:59:30.522748 14568 registrar.cpp:391] Successfully fetched the registry (0B) in 484864ns I0327 17:59:30.522820 14568 registrar.cpp:495] Applied 1 operations in 27646ns; attempting to update the registry I0327 17:59:30.523268 14568 registrar.cpp:552] Successfully updated the registry in 419840ns I0327 17:59:30.523350 14568 registrar.cpp:424] Successfully recovered registrar I0327 17:59:30.523630 14568 master.cpp:1796] Recovered 0 agents from the registry (170B); allowing 10mins for agents to reregister I0327 17:59:30.523769 14568 hierarchical.cpp:213] Skipping recovery of hierarchical allocator: nothing to recover W0327 17:59:30.552417 14537 process.cpp:2805] Attempted to spawn already running process files@192.99.40.208:39245 I0327 17:59:30.569679 14537 containerizer.cpp:304] Using isolation { environment_secret, filesystem/posix, network/cni, posix/cpu, posix/mem } W0327 17:59:30.577472 14537 backend.cpp:76] Failed to create 'bind' backend: BindBackend requires root privileges I0327 17:59:30.577519 14537 provisioner.cpp:299] Using default backend 'copy' I0327 17:59:30.601388 14537 cluster.cpp:460] Creating default 'local' authorizer I0327 17:59:30.604290 14553 slave.cpp:261] Mesos agent started on (49)@192.99.40.208:39245 I0327 17:59:30.604341 14553 slave.cpp:262] Flags at startup: --acls="" --appc_simple_discovery_uri_prefix="http://;
[jira] [Created] (MESOS-8735) Implement recovery for resource provider manager registrar
Benjamin Bannier created MESOS-8735: --- Summary: Implement recovery for resource provider manager registrar Key: MESOS-8735 URL: https://issues.apache.org/jira/browse/MESOS-8735 Project: Mesos Issue Type: Task Components: agent, master, storage Affects Versions: 1.6.0 Reporter: Benjamin Bannier Assignee: Benjamin Bannier In order to properly persist and recover resource provider information in the resource provider manager we should # Include a registrar in the manager, and # Implement missing recovery functionality in the registrar so it can return a recovered registry. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-8734) Restore `WaitAfterDestroy` test to check termination status of a terminated nested container.
Andrei Budnik created MESOS-8734: Summary: Restore `WaitAfterDestroy` test to check termination status of a terminated nested container. Key: MESOS-8734 URL: https://issues.apache.org/jira/browse/MESOS-8734 Project: Mesos Issue Type: Task Reporter: Andrei Budnik It's important to check that after termination of a nested container, its termination status is available. This property is used in default executor. Right now, if we remove [this section of code|https://github.com/apache/mesos/blob/5b655ce062ff55cdefed119d97ad923aeeb2efb5/src/slave/containerizer/mesos/containerizer.cpp#L2093-L2111], no test will be broken! https://reviews.apache.org/r/65505 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-8733) OversubscriptionTest.ForwardUpdateSlaveMessage is flaky
Benjamin Bannier created MESOS-8733: --- Summary: OversubscriptionTest.ForwardUpdateSlaveMessage is flaky Key: MESOS-8733 URL: https://issues.apache.org/jira/browse/MESOS-8733 Project: Mesos Issue Type: Bug Components: test Affects Versions: 1.6.0 Reporter: Benjamin Bannier Assignee: Benjamin Bannier Observed this failure in CI, {noformat} [ RUN ] OversubscriptionTest.ForwardUpdateSlaveMessage 3: I0327 10:12:04.032042 18320 cluster.cpp:172] Creating default 'local' authorizer 3: I0327 10:12:04.035696 18321 master.cpp:463] Master b5c97327-11cc-4183-82ed-75e62b71cc58 (1931c74e0c4c) started on 172.17.0.2:35020 3: I0327 10:12:04.035732 18321 master.cpp:466] Flags at startup: --acls="" --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" --allocation_interval="1secs" --allocator="HierarchicalDRF" --authenticate_agents="true" --authenticate_frameworks="true" --authenticate_http_frameworks="true" --authenticate_http_readonly="true" --authenticate_http_readwrite="true" --authenticators="crammd5" --authorizers="local" --credentials="/tmp/4j65Va/credentials" --filter_gpu_resources="true" --framework_sorter="drf" --help="false" --hostname_lookup="true" --http_authenticators="basic" --http_framework_authenticators="basic" --initialize_driver_logging="true" --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" --max_unreachable_tasks_per_framework="1000" --port="5050" --quiet="false" --recovery_agent_removal_limit="100%" --registry="in_memory" --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" --registry_store_timeout="100secs" --registry_strict="false" --require_agent_domain="false" --root_submissions="true" --user_sorter="drf" --version="false" --webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/4j65Va/master" --zk_session_timeout="10secs" 3: I0327 10:12:04.036129 18321 master.cpp:515] Master only allowing authenticated frameworks to register 3: I0327 10:12:04.036140 18321 master.cpp:521] Master only allowing authenticated agents to register 3: I0327 10:12:04.036147 18321 master.cpp:527] Master only allowing authenticated HTTP frameworks to register 3: I0327 10:12:04.036156 18321 credentials.hpp:37] Loading credentials for authentication from '/tmp/4j65Va/credentials' 3: I0327 10:12:04.036468 18321 master.cpp:571] Using default 'crammd5' authenticator 3: I0327 10:12:04.036643 18321 http.cpp:959] Creating default 'basic' HTTP authenticator for realm 'mesos-master-readonly' 3: I0327 10:12:04.036834 18321 http.cpp:959] Creating default 'basic' HTTP authenticator for realm 'mesos-master-readwrite' 3: I0327 10:12:04.037005 18321 http.cpp:959] Creating default 'basic' HTTP authenticator for realm 'mesos-master-scheduler' 3: I0327 10:12:04.037170 18321 master.cpp:652] Authorization enabled 3: I0327 10:12:04.037370 18338 whitelist_watcher.cpp:77] No whitelist given 3: I0327 10:12:04.037374 18322 hierarchical.cpp:175] Initialized hierarchical allocator process 3: I0327 10:12:04.040787 18321 master.cpp:2126] Elected as the leading master! 3: I0327 10:12:04.040812 18321 master.cpp:1682] Recovering from registrar 3: I0327 10:12:04.040966 18342 registrar.cpp:347] Recovering registrar 3: I0327 10:12:04.041606 18330 registrar.cpp:391] Successfully fetched the registry (0B) in 590848ns 3: I0327 10:12:04.041764 18330 registrar.cpp:495] Applied 1 operations in 57052ns; attempting to update the registry 3: I0327 10:12:04.042466 18330 registrar.cpp:552] Successfully updated the registry in 638976ns 3: I0327 10:12:04.042615 18330 registrar.cpp:424] Successfully recovered registrar 3: I0327 10:12:04.043128 18339 master.cpp:1796] Recovered 0 agents from the registry (135B); allowing 10mins for agents to reregister 3: I0327 10:12:04.043151 18326 hierarchical.cpp:213] Skipping recovery of hierarchical allocator: nothing to recover 3: W0327 10:12:04.048898 18320 process.cpp:2805] Attempted to spawn already running process files@172.17.0.2:35020 3: I0327 10:12:04.050076 18320 containerizer.cpp:304] Using isolation { environment_secret, posix/cpu, posix/mem, filesystem/posix, network/cni } 3: W0327 10:12:04.050720 18320 backend.cpp:76] Failed to create 'aufs' backend: AufsBackend requires root privileges 3: W0327 10:12:04.050746 18320 backend.cpp:76] Failed to create 'bind' backend: BindBackend requires root privileges 3: I0327 10:12:04.050791 18320 provisioner.cpp:299] Using default backend 'copy' 3: I0327 10:12:04.053491 18320 cluster.cpp:460] Creating default 'local' authorizer 3: I0327 10:12:04.056531 18326 slave.cpp:261] Mesos agent started on (546)@172.17.0.2:35020 3: I0327 10:12:04.056571 18326 slave.cpp:262] Flags at startup:
[jira] [Commented] (MESOS-1466) Race between executor exited event and launch task can cause overcommit of resources
[ https://issues.apache.org/jira/browse/MESOS-1466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16415612#comment-16415612 ] Benno Evers commented on MESOS-1466: If I understand the issue correctly, this race seems to have been eliminated as a side-effect of introducing the `launch_executor` flag in Mesos 1.5: When the master sends the `RunTaskMessage` to the agent, it thinks that the specified executor is still running on the agent, so it will set `launch_executor = false`: {noformat} // src/master/master.cpp:3841 bool Master::isLaunchExecutor( const ExecutorID& executorId, Framework* framework, Slave* slave) const { CHECK_NOTNULL(framework); CHECK_NOTNULL(slave); if (!slave->hasExecutor(framework->id(), executorId)) { CHECK(!framework->hasExecutor(slave->id, executorId)) << "Executor '" << executorId << "' known to the framework " << *framework << " but unknown to the agent " << *slave; return true; } return false; }{noformat} On the slave, when the executor doesn't exist anymore, the task is dropped with reason `REASON_EXECUTOR_TERMINATED`: {noformat} // src/slave/slave.cpp:2881 // Master does not want to launch executor. if (executor == nullptr) { // Master wants no new executor launched and there is none running on // the agent. This could happen if the task expects some previous // tasks to launch the executor. However, the earlier task got killed // or dropped hence did not launch the executor but the master doesn't // know about it yet because the `ExitedExecutorMessage` is still in // flight. In this case, we will drop the task. // // We report TASK_DROPPED to the framework because the task was // never launched. For non-partition-aware frameworks, we report // TASK_LOST for backward compatibility. mesos::TaskState taskState = TASK_DROPPED; if (!protobuf::frameworkHasCapability( frameworkInfo, FrameworkInfo::Capability::PARTITION_AWARE)) { taskState = TASK_LOST; } foreach (const TaskInfo& _task, tasks) { const StatusUpdate update = protobuf::createStatusUpdate( frameworkId, info.id(), _task.task_id(), taskState, TaskStatus::SOURCE_SLAVE, id::UUID::random(), "No executor is expected to launch and there is none running", TaskStatus::REASON_EXECUTOR_TERMINATED, executorId); statusUpdate(update, UPID()); } // We do not send `ExitedExecutorMessage` here because the expectation // is that there is already one on the fly to master. If the message // gets dropped, we will hopefully reconcile with the master later. return; }{noformat} > Race between executor exited event and launch task can cause overcommit of > resources > > > Key: MESOS-1466 > URL: https://issues.apache.org/jira/browse/MESOS-1466 > Project: Mesos > Issue Type: Bug > Components: allocation, master >Reporter: Vinod Kone >Priority: Major > Labels: reliability, twitter > > The following sequence of events can cause an overcommit > --> Launch task is called for a task whose executor is already running > --> Executor's resources are not accounted for on the master > --> Executor exits and the event is enqueued behind launch tasks on the master > --> Master sends the task to the slave which needs to commit for resources > for task and the (new) executor. > --> Master processes the executor exited event and re-offers the executor's > resources causing an overcommit of resources. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-1466) Race between executor exited event and launch task can cause overcommit of resources
[ https://issues.apache.org/jira/browse/MESOS-1466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benno Evers reassigned MESOS-1466: -- Resolution: Fixed Assignee: Meng Zhu > Race between executor exited event and launch task can cause overcommit of > resources > > > Key: MESOS-1466 > URL: https://issues.apache.org/jira/browse/MESOS-1466 > Project: Mesos > Issue Type: Bug > Components: allocation, master >Reporter: Vinod Kone >Assignee: Meng Zhu >Priority: Major > Labels: reliability, twitter > > The following sequence of events can cause an overcommit > --> Launch task is called for a task whose executor is already running > --> Executor's resources are not accounted for on the master > --> Executor exits and the event is enqueued behind launch tasks on the master > --> Master sends the task to the slave which needs to commit for resources > for task and the (new) executor. > --> Master processes the executor exited event and re-offers the executor's > resources causing an overcommit of resources. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-8732) Use composing containerizer by default in tests.
Andrei Budnik created MESOS-8732: Summary: Use composing containerizer by default in tests. Key: MESOS-8732 URL: https://issues.apache.org/jira/browse/MESOS-8732 Project: Mesos Issue Type: Task Components: containerization Reporter: Andrei Budnik If we assign "docker,mesos" to the `containerizers` flag for an agent, then `ComposingContainerizer` will be used for many tests that do not specify `containerizers` flag. That's the goal of this task. I tried to do that by adding [`flags.containerizers = "docker,mesos";`|https://github.com/apache/mesos/blob/master/src/tests/mesos.cpp#L273], but it turned out that some tests are started to hang due to a paused clocks, while docker c'zer and docker library use libprocess clocks. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-7616) Consider supporting changes to agent's domain without full drain.
[ https://issues.apache.org/jira/browse/MESOS-7616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16415463#comment-16415463 ] Benno Evers commented on MESOS-7616: Bookkeeping note: I've assigned the same number of story points to this and the corresponding epic MESOS-1739, please correct if this isn't the correct accounting method @[~vinodkone]. > Consider supporting changes to agent's domain without full drain. > - > > Key: MESOS-7616 > URL: https://issues.apache.org/jira/browse/MESOS-7616 > Project: Mesos > Issue Type: Improvement >Reporter: Neil Conway >Assignee: Benno Evers >Priority: Major > Labels: mesosphere > Fix For: 1.5.0 > > > In the initial review chain, any change to an agent's domain requires a full > drain. This is simple and straightforward, but it makes it more difficult for > operators to opt-in to using fault domains. > We should consider allowing agents to transition from "no configured domain" > to "configured domain" without requiring an agent drain. This has some > complications, however: e.g., without an API for communicating changes in an > agent's configuration to frameworks, they might not realize that an agent's > domain has changed. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-8731) mesos master APIs become latent
sri krishna created MESOS-8731: -- Summary: mesos master APIs become latent Key: MESOS-8731 URL: https://issues.apache.org/jira/browse/MESOS-8731 Project: Mesos Issue Type: Bug Components: master Affects Versions: 1.5.0, 1.4.0 Reporter: sri krishna Over a period of time one of the UI API call to the master becomes latent. Normally the request that takes less than a second takes up to 20 seconds during peak. A lot of the dev team access the UI for logs. Below are my observations : In mesos "0.28.1-2.0.20.ubuntu1404" # ab -n 1000 -c 10 "http://mesos-master1.mesos.bla.net:5050/metrics/snapshot?jsonp=angular.callbacks._4g; This is ApacheBench, Version 2.3 <$Revision: 1528965 $> Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/ Licensed to The Apache Software Foundation, http://www.apache.org/ Benchmarking mesos-master1.mesos.bla.net (be patient) Completed 100 requests Completed 200 requests Completed 300 requests Completed 400 requests Completed 500 requests Completed 600 requests Completed 700 requests Completed 800 requests Completed 900 requests Completed 1000 requests Finished 1000 requests Server Software: Server Hostname: mesos-master1.mesos.bla.net Server Port: 5050 Document Path: /metrics/snapshot?jsonp=angular.callbacks._4g Document Length: 3197 bytes Concurrency Level: 10 Time taken for tests: 501.010 seconds Complete requests: 1000 Failed requests: 954 (Connect: 0, Receive: 0, Length: 954, Exceptions: 0) Total transferred: 3304510 bytes HTML transferred: 3195510 bytes Requests per second: 2.00 [#/sec] (mean) Time per request: 5010.104 [ms] (mean) Time per request: 501.010 [ms] (mean, across all concurrent requests) Transfer rate: 6.44 [Kbytes/sec] received Connection Times (ms) min mean[+/-sd] median max Connect: 0 0 0.0 0 0 Processing: 321 4987 286.4 5007 5508 Waiting: 321 4987 286.4 5007 5508 Total: 321 4988 286.4 5007 5508 Percentage of the requests served within a certain time (ms) 50% 5007 66% 5007 75% 5008 80% 5008 90% 5008 95% 5009 98% 5010 99% 5506 100% 5508 (longest request) In mesos 1.4 and 1.5 (versions 1.4.0-2.0.1 and 1.5.0-2.0.1) the response of these APIs is quite high. # ab -n 1000 -c 10 "http://mesos-master3.stage.bla.net:5050/metrics/snapshot?jsonp=angular.callbacks._4g; This is ApacheBench, Version 2.3 <$Revision: 1706008 $> Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/ Licensed to The Apache Software Foundation, http://www.apache.org/ Benchmarking mesos-master3.stage.bla.net (be patient) Completed 100 requests Completed 200 requests Completed 300 requests Completed 400 requests Completed 500 requests ^C Server Software: Server Hostname: mesos-master3.stage.bla.net Server Port: 5050 Document Path: /metrics/snapshot?jsonp=angular.callbacks._4g Document Length: 6596 bytes Concurrency Level: 10 Time taken for tests: 1405.182 seconds Complete requests: 582 Failed requests: 580 (Connect: 0, Receive: 0, Length: 580, Exceptions: 0) Total transferred: 3909986 bytes HTML transferred: 3846548 bytes Requests per second: 0.41 [#/sec] (mean) Time per request: 24144.024 [ms] (mean) Time per request: 2414.402 [ms] (mean, across all concurrent requests) Transfer rate: 2.72 [Kbytes/sec] received Connection Times (ms) min mean[+/-sd] median max Connect: 0 0 0.0 0 0 Processing: 15284 24058 2600.7 23937 31740 Waiting: 15284 24058 2600.7 23937 31740 Total: 15284 24059 2600.7 23938 31740 Percentage of the requests served within a certain time (ms) 50% 23938 66% 25074 75% 25729 80% 26465 90% 27605 95% 28215 98% 29685 99% 30595 100% 31740 (longest request) I think this is causing the others APIs like "/master/slaves/ and "/metrics" to become latent. At this point we are forcing a re-elect of the the master to bring the times down. What can I do to bring this times down? The load on the box is quite less. The load average does not cross 2 on a 8 core box. Let me know if any further info is required. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8557) Default executor should allow decreasing the escalation grace period of a terminating task
[ https://issues.apache.org/jira/browse/MESOS-8557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16415345#comment-16415345 ] Alexander Rukletsov commented on MESOS-8557: {noformat} commit 4e7bbe67f55fbaa560466fc1d0a2f5e5bdb6ab32 Author: Gaston KleimanAuthorDate: Tue Mar 27 11:38:13 2018 +0200 Commit: Alexander Rukletsov CommitDate: Tue Mar 27 11:38:13 2018 +0200 Added a reference to MESOS-8557 to the default executor. Review: https://reviews.apache.org/r/66235/ {noformat} > Default executor should allow decreasing the escalation grace period of a > terminating task > -- > > Key: MESOS-8557 > URL: https://issues.apache.org/jira/browse/MESOS-8557 > Project: Mesos > Issue Type: Bug >Reporter: Gastón Kleiman >Priority: Major > Labels: default-executor, gracefulshutdown, mesosphere > > The command executor supports [decreasing the escalation grace period of a > terminating > task|https://github.com/apache/mesos/blob/c665dd6c22715fa941200020a8f7209f1f5b1ca1/src/launcher/executor.cpp#L800-L803]. > For consistency, this should also be supported by the default executor. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-8534) Allow nested containers in TaskGroups to have separate network namespaces
[ https://issues.apache.org/jira/browse/MESOS-8534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jie Yu reassigned MESOS-8534: - Assignee: Sagar Sadashiv Patwardhan > Allow nested containers in TaskGroups to have separate network namespaces > - > > Key: MESOS-8534 > URL: https://issues.apache.org/jira/browse/MESOS-8534 > Project: Mesos > Issue Type: Task > Components: containerization >Reporter: Sagar Sadashiv Patwardhan >Assignee: Sagar Sadashiv Patwardhan >Priority: Minor > Labels: cni > Fix For: 1.6.0 > > > As per the discussion with [~jieyu] and [~avinash.mesos] , I am going to > allow nested containers in TaskGroups to have separate namespaces. I am also > going to retain the existing functionality, where nested containers can share > namespaces with the parent/root container. > *Use case:* At Yelp, we have this application called seagull that runs > multiple tasks in parallel. It is mainly used for running tests that depend > on other containerized internal microservices. It was developed before mesos > had support for docker-executor. So, it uses a custom executor, which > directly talks to docker daemon on the host and run a bunch of service > containers along with the process where tests are executed. Resources for all > these containers are not accounted for in mesos. Clean-up of these containers > is also a headache. We have a tool called docker-reaper that automatically > reaps the orphaned containers once the executor goes away. In addition to > that, we also run a few cron jobs that clean-up any leftover containers. > We are in the process of containerizing the process that runs the tests. We > also want to delegate the responsibility of lifecycle management of docker > containers to mesos and get rid of the custom executor. We looked at a few > alternatives to do this and decided to go with pods because they provide > all-or-nothing(atomicity) semantics that we need for our application. But, we > cannot use pods directly because all the containers in a pod have the same > network namespace. The service discovery mechanism requires all the > containers to have separate IPs. All of our microservices bind to > container port, so we will have port collision unless we are giving separate > namespaces to all the containers in a pod. > *Proposal:* I am planning to allow nested containers to have separate > namespaces. If NetworkInfo protobuf for nested containers is not empty, then > we will assign separate mnt and network namespaces to the nested containers. > Otherwise, they will share the network and mount namepsaces with the > parent/root container. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8534) Allow nested containers in TaskGroups to have separate network namespaces
[ https://issues.apache.org/jira/browse/MESOS-8534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16415104#comment-16415104 ] Jie Yu commented on MESOS-8534: --- commit a741b15e889de3242e3aa7878105ab9d946f6ea2 (HEAD -> master, origin/master, origin/HEAD) Author: Sagar PatwardhanDate: Mon Mar 26 21:13:17 2018 -0700 Allowed a nested container to have a separate network namespace. Previously, nested containers always share the same network namespace as their parent. This patch allows a nested container to have a separate network namespace than its parent. Continued from https://github.com/apache/mesos/pull/263 JIRA: MESOS-8534 Review: https://reviews.apache.org/r/65987/ commit 020b8cbafaf70ef4b95915bf9b81200509b23a50 Author: Jie Yu Date: Mon Mar 26 23:28:20 2018 -0700 Fixed createVolumeHostPath helper. commit 77c56351e9bfabea221c6be84472e64b434b5169 Author: Jie Yu Date: Mon Mar 26 21:14:52 2018 -0700 Added a helper to parse ContainerID. Review: https://reviews.apache.org/r/66101/ > Allow nested containers in TaskGroups to have separate network namespaces > - > > Key: MESOS-8534 > URL: https://issues.apache.org/jira/browse/MESOS-8534 > Project: Mesos > Issue Type: Task > Components: containerization >Reporter: Sagar Sadashiv Patwardhan >Priority: Minor > Labels: cni > > As per the discussion with [~jieyu] and [~avinash.mesos] , I am going to > allow nested containers in TaskGroups to have separate namespaces. I am also > going to retain the existing functionality, where nested containers can share > namespaces with the parent/root container. > *Use case:* At Yelp, we have this application called seagull that runs > multiple tasks in parallel. It is mainly used for running tests that depend > on other containerized internal microservices. It was developed before mesos > had support for docker-executor. So, it uses a custom executor, which > directly talks to docker daemon on the host and run a bunch of service > containers along with the process where tests are executed. Resources for all > these containers are not accounted for in mesos. Clean-up of these containers > is also a headache. We have a tool called docker-reaper that automatically > reaps the orphaned containers once the executor goes away. In addition to > that, we also run a few cron jobs that clean-up any leftover containers. > We are in the process of containerizing the process that runs the tests. We > also want to delegate the responsibility of lifecycle management of docker > containers to mesos and get rid of the custom executor. We looked at a few > alternatives to do this and decided to go with pods because they provide > all-or-nothing(atomicity) semantics that we need for our application. But, we > cannot use pods directly because all the containers in a pod have the same > network namespace. The service discovery mechanism requires all the > containers to have separate IPs. All of our microservices bind to > container port, so we will have port collision unless we are giving separate > namespaces to all the containers in a pod. > *Proposal:* I am planning to allow nested containers to have separate > namespaces. If NetworkInfo protobuf for nested containers is not empty, then > we will assign separate mnt and network namespaces to the nested containers. > Otherwise, they will share the network and mount namepsaces with the > parent/root container. -- This message was sent by Atlassian JIRA (v7.6.3#76005)