[jira] [Created] (MESOS-8745) Add a `LIST_RESOURCE_PROVIDER_CONFIGS` agent API call.

2018-03-27 Thread Chun-Hung Hsiao (JIRA)
Chun-Hung Hsiao created MESOS-8745:
--

 Summary: Add a `LIST_RESOURCE_PROVIDER_CONFIGS` agent API call.
 Key: MESOS-8745
 URL: https://issues.apache.org/jira/browse/MESOS-8745
 Project: Mesos
  Issue Type: Task
  Components: agent
Reporter: Chun-Hung Hsiao
Assignee: Chun-Hung Hsiao


For API completeness, it would be nice if we can provider a call to list all 
valid resource provider configs on an agent.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8744) Test `ForwardUpdateSlaveMessage` flaky.

2018-03-27 Thread Meng Zhu (JIRA)
Meng Zhu created MESOS-8744:
---

 Summary: Test `ForwardUpdateSlaveMessage` flaky.
 Key: MESOS-8744
 URL: https://issues.apache.org/jira/browse/MESOS-8744
 Project: Mesos
  Issue Type: Bug
Reporter: Meng Zhu
 Attachments: Badrun_ForwardUpdateSlaveMessage.txt





--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8743) Fix Mesos 1.5.x upgrade doc for allocator module interface changes.

2018-03-27 Thread Gilbert Song (JIRA)
Gilbert Song created MESOS-8743:
---

 Summary: Fix Mesos 1.5.x upgrade doc for allocator module 
interface changes.
 Key: MESOS-8743
 URL: https://issues.apache.org/jira/browse/MESOS-8743
 Project: Mesos
  Issue Type: Documentation
  Components: allocation, documentation
Reporter: Gilbert Song


Update the 1.5.x upgrade doc for the recent allocator module API changes:
* 
https://github.com/apache/mesos/blame/master/include/mesos/allocator/allocator.hpp#L288
* 
https://github.com/apache/mesos/commit/9015cd316bf6d185363cd0caf1705e2fb118ed63#diff-e8f9b112d5e3ed340294e42fa4fc0a6e



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8742) Agent resource provider config API calls should be idempotent.

2018-03-27 Thread Chun-Hung Hsiao (JIRA)
Chun-Hung Hsiao created MESOS-8742:
--

 Summary: Agent resource provider config API calls should be 
idempotent.
 Key: MESOS-8742
 URL: https://issues.apache.org/jira/browse/MESOS-8742
 Project: Mesos
  Issue Type: Bug
Reporter: Chun-Hung Hsiao
Assignee: Chun-Hung Hsiao


There are some issues w.r.t. using the current agent resource provider config 
API calls:

1. {{UPDATE_RESOURCE_PROVIDER_CONFIG}}: If the caller fail to receive the HTTP 
response code, there is no way to retry the operation without triggering an RP 
restart.
2. {{ REMOVE_RESOURCE_PROVIDER_CONFIG}}: Due to MESOS-7697, if the caller fail 
to receive the HTTP response code, a retry will return a 404 Not Found. But due 
to MESOS-7697, there is no way for the caller to know if the 404 is due to a 
previous successful config removal or not.

To address these issues, we should make these calls idempotent, such that they 
return 200 OK when the caller retry. It would be nice if 
{{ADD_RESOURCE_PROVIDER_CONFIG}} is also idempotent for consistency.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8729) Libprocess: deadlock in process::finalize

2018-03-27 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16416273#comment-16416273
 ] 

Benjamin Mahler commented on MESOS-8729:


A couple of additional finalization related issues I noticed while trying to 
write a small test to reproduce this:

(1) re-initialization after finalize is not supported and crashes
(2) things like spawn() implicitly re-initialize and therefore if things like 
spawn get called post-finalize() they will crash
(3) even without implicit initialization, a spawn will access the process 
manager pointer and crash
(4) double-finalize() will crash the program.

Before resolving this particular bug, will create an epic to track these other 
issues if one doesn't already exist. cc [~kaysoky]

> Libprocess: deadlock in process::finalize
> -
>
> Key: MESOS-8729
> URL: https://issues.apache.org/jira/browse/MESOS-8729
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Affects Versions: 1.6.0
> Environment: The issue has been reproduced on Ubuntu 16.04, master 
> branch, commit `42848653b2`. 
>Reporter: Andrei Budnik
>Priority: Major
>  Labels: deadlock, libprocess
> Attachments: deadlock.txt
>
>
> Since we are calling 
> [`libprocess::finalize()`|https://github.com/apache/mesos/blob/02ebf9986ab5ce883a71df72e9e3392a3e37e40e/src/slave/containerizer/mesos/io/switchboard_main.cpp#L157]
>  before returning from the IOSwitchboard's main function, we expect that all 
> http responses are going to be sent back to clients before IOSwitchboard 
> terminates. However, after [adding|https://reviews.apache.org/r/66147/] 
> `libprocess::finalize()` we have seen that IOSwitchboard might get stuck in 
> `libprocess::finalize()`. See attached stacktrace.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8741) `Add` to sequence will not run if it races with sequence destruction

2018-03-27 Thread Meng Zhu (JIRA)
Meng Zhu created MESOS-8741:
---

 Summary: `Add` to sequence will not run if it races with sequence 
destruction
 Key: MESOS-8741
 URL: https://issues.apache.org/jira/browse/MESOS-8741
 Project: Mesos
  Issue Type: Bug
 Environment: 


Reporter: Meng Zhu
Assignee: Meng Zhu


Adding item to sequence is realized by dispatching `add()` to the sequence 
actor. However, this could race with the sequence actor destruction.:

After the dispatch but before the dispatched `add()` message gets processed by 
the sequence actor, if the sequence gets destroyed, a terminate message will be 
injected to the *head* of the message queue. This would result in the 
destruction of the sequence without the `add()` call ever gets processed. User 
would end up with a pending future and the future's `onDiscarded' would not be 
triggered during the sequence destruction.

The solution is to set the `inject` flag to `false` so that the terminating 
message is enqueued to the end of the sequence actor message queue. All `add()` 
messages that happen before the destruction will be processed before the 
terminating message.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8740) Update description of a Containerizer interface.

2018-03-27 Thread Andrei Budnik (JIRA)
Andrei Budnik created MESOS-8740:


 Summary: Update description of a Containerizer interface.
 Key: MESOS-8740
 URL: https://issues.apache.org/jira/browse/MESOS-8740
 Project: Mesos
  Issue Type: Documentation
Reporter: Andrei Budnik


[Containerizer 
interface|https://github.com/apache/mesos/blob/master/src/slave/containerizer/containerizer.hpp]
 must be updated with respect to the latest changes. In addition, it should 
clearly describe semantics of `wait()` and `destroy()` methods, including cases 
with a nested containers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8534) Allow nested containers in TaskGroups to have separate network namespaces

2018-03-27 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16415995#comment-16415995
 ] 

ASF GitHub Bot commented on MESOS-8534:
---

Github user sagar8192 closed the pull request at:

https://github.com/apache/mesos/pull/263


> Allow nested containers in TaskGroups to have separate network namespaces
> -
>
> Key: MESOS-8534
> URL: https://issues.apache.org/jira/browse/MESOS-8534
> Project: Mesos
>  Issue Type: Task
>  Components: containerization
>Reporter: Sagar Sadashiv Patwardhan
>Assignee: Sagar Sadashiv Patwardhan
>Priority: Minor
>  Labels: cni
> Fix For: 1.6.0
>
>
> As per the discussion with [~jieyu] and [~avinash.mesos] , I am going to 
> allow nested containers in TaskGroups to have separate namespaces. I am also 
> going to retain the existing functionality, where nested containers can share 
> namespaces with the parent/root container.
> *Use case:* At Yelp, we have this application called seagull that runs 
> multiple tasks in parallel. It is mainly used for running tests that depend 
> on other containerized internal microservices. It was developed before mesos 
> had support for docker-executor. So, it uses a custom executor, which 
> directly talks to docker daemon on the host and run a bunch of service 
> containers along with the process where tests are executed. Resources for all 
> these containers are not accounted for in mesos. Clean-up of these containers 
> is also a headache. We have a tool called docker-reaper that automatically 
> reaps the orphaned containers once the executor goes away. In addition to 
> that, we also run a few cron jobs that clean-up any leftover containers.
> We are in the process of containerizing the process that runs the tests. We 
> also want to delegate the responsibility of lifecycle management of docker 
> containers to mesos and get rid of the custom executor. We looked at a few 
> alternatives to do this and decided to go with pods because they provide 
> all-or-nothing(atomicity) semantics that we need for our application. But, we 
> cannot use pods directly because all the containers in a pod have the same 
> network namespace. The service discovery mechanism requires all the 
> containers to have separate IPs. All of our microservices bind to  
> container port, so we will have port collision unless we are giving separate 
> namespaces to all the containers in a pod.
> *Proposal:* I am planning to allow nested containers to have separate 
> namespaces. If NetworkInfo protobuf for nested containers is not empty, then 
> we will assign separate mnt and network namespaces to the nested containers. 
> Otherwise,  they will share the network and mount namepsaces with the 
> parent/root container.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8534) Allow nested containers in TaskGroups to have separate network namespaces

2018-03-27 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16415994#comment-16415994
 ] 

ASF GitHub Bot commented on MESOS-8534:
---

Github user sagar8192 commented on the issue:

https://github.com/apache/mesos/pull/263
  
Closed in favor of https://reviews.apache.org/r/65987/


> Allow nested containers in TaskGroups to have separate network namespaces
> -
>
> Key: MESOS-8534
> URL: https://issues.apache.org/jira/browse/MESOS-8534
> Project: Mesos
>  Issue Type: Task
>  Components: containerization
>Reporter: Sagar Sadashiv Patwardhan
>Assignee: Sagar Sadashiv Patwardhan
>Priority: Minor
>  Labels: cni
> Fix For: 1.6.0
>
>
> As per the discussion with [~jieyu] and [~avinash.mesos] , I am going to 
> allow nested containers in TaskGroups to have separate namespaces. I am also 
> going to retain the existing functionality, where nested containers can share 
> namespaces with the parent/root container.
> *Use case:* At Yelp, we have this application called seagull that runs 
> multiple tasks in parallel. It is mainly used for running tests that depend 
> on other containerized internal microservices. It was developed before mesos 
> had support for docker-executor. So, it uses a custom executor, which 
> directly talks to docker daemon on the host and run a bunch of service 
> containers along with the process where tests are executed. Resources for all 
> these containers are not accounted for in mesos. Clean-up of these containers 
> is also a headache. We have a tool called docker-reaper that automatically 
> reaps the orphaned containers once the executor goes away. In addition to 
> that, we also run a few cron jobs that clean-up any leftover containers.
> We are in the process of containerizing the process that runs the tests. We 
> also want to delegate the responsibility of lifecycle management of docker 
> containers to mesos and get rid of the custom executor. We looked at a few 
> alternatives to do this and decided to go with pods because they provide 
> all-or-nothing(atomicity) semantics that we need for our application. But, we 
> cannot use pods directly because all the containers in a pod have the same 
> network namespace. The service discovery mechanism requires all the 
> containers to have separate IPs. All of our microservices bind to  
> container port, so we will have port collision unless we are giving separate 
> namespaces to all the containers in a pod.
> *Proposal:* I am planning to allow nested containers to have separate 
> namespaces. If NetworkInfo protobuf for nested containers is not empty, then 
> we will assign separate mnt and network namespaces to the nested containers. 
> Otherwise,  they will share the network and mount namepsaces with the 
> parent/root container.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8739) Implement a test to check that a launched container can be killed.

2018-03-27 Thread Andrei Budnik (JIRA)
Andrei Budnik created MESOS-8739:


 Summary: Implement a test to check that a launched container can 
be killed.
 Key: MESOS-8739
 URL: https://issues.apache.org/jira/browse/MESOS-8739
 Project: Mesos
  Issue Type: Task
Reporter: Andrei Budnik


This test launches a long-running task, then calls `wait()` and `destroy()` 
methods of the composing containerizer. Both termination statuses must be equal.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8738) Implement a test to check that a recovered container can be killed.

2018-03-27 Thread Andrei Budnik (JIRA)
Andrei Budnik created MESOS-8738:


 Summary: Implement a test to check that a recovered container can 
be killed.
 Key: MESOS-8738
 URL: https://issues.apache.org/jira/browse/MESOS-8738
 Project: Mesos
  Issue Type: Task
Reporter: Andrei Budnik


This test verifies that a recovered container can be killed via `destroy()` 
method of composing containerizer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-7697) Mesos scheduler v1 HTTP API may generate 404 errors for temporary conditions

2018-03-27 Thread James DeFelice (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16415941#comment-16415941
 ] 

James DeFelice edited comment on MESOS-7697 at 3/27/18 5:14 PM:


[https://github.com/apache/mesos/blob/124c677c86c7b12ca4568f004895b8ca30d60dcf/3rdparty/libprocess/src/process.cpp#L3674]

https://github.com/apache/mesos/blob/124c677c86c7b12ca4568f004895b8ca30d60dcf/3rdparty/libprocess/src/process.cpp#L2729-L2740


was (Author: jdef):
[https://github.com/apache/mesos/blob/124c677c86c7b12ca4568f004895b8ca30d60dcf/3rdparty/libprocess/src/process.cpp#L3674]

https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/process.cpp#L2729-L2740

> Mesos scheduler v1 HTTP API may generate 404 errors for temporary conditions
> 
>
> Key: MESOS-7697
> URL: https://issues.apache.org/jira/browse/MESOS-7697
> Project: Mesos
>  Issue Type: Bug
>  Components: HTTP API, libprocess
>Reporter: James DeFelice
>Priority: Major
>  Labels: mesosphere
>
> Returning a 404 error for a condition that's a known temporary condition is 
> confusing from a client's perspective. A client wants to know how to recover 
> from various error conditions. A 404 error condition should be distinct from 
> a "server is not yet ready, but will be shortly" condition (which should 
> probably be reported as a 503 "unavailable" error).
> https://github.com/apache/mesos/blob/72752fc6deb8ebcbfbd5448dc599ef3774339d31/src/scheduler/scheduler.cpp#L593
> {code}
> if (response->code == process::http::Status::NOT_FOUND) {
>   // This could happen if the master libprocess process has not yet set up
>   // HTTP routes.
>   LOG(WARNING) << "Received '" << response->status << "' ("
><< response->body << ") for " << call.type();
>   return;
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-7697) Mesos scheduler v1 HTTP API may generate 404 errors for temporary conditions

2018-03-27 Thread James DeFelice (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16415941#comment-16415941
 ] 

James DeFelice edited comment on MESOS-7697 at 3/27/18 5:13 PM:


[https://github.com/apache/mesos/blob/124c677c86c7b12ca4568f004895b8ca30d60dcf/3rdparty/libprocess/src/process.cpp#L3674]

https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/process.cpp#L2729-L2740


was (Author: jdef):
https://github.com/apache/mesos/blob/124c677c86c7b12ca4568f004895b8ca30d60dcf/3rdparty/libprocess/src/process.cpp#L3674

> Mesos scheduler v1 HTTP API may generate 404 errors for temporary conditions
> 
>
> Key: MESOS-7697
> URL: https://issues.apache.org/jira/browse/MESOS-7697
> Project: Mesos
>  Issue Type: Bug
>  Components: HTTP API, libprocess
>Reporter: James DeFelice
>Priority: Major
>  Labels: mesosphere
>
> Returning a 404 error for a condition that's a known temporary condition is 
> confusing from a client's perspective. A client wants to know how to recover 
> from various error conditions. A 404 error condition should be distinct from 
> a "server is not yet ready, but will be shortly" condition (which should 
> probably be reported as a 503 "unavailable" error).
> https://github.com/apache/mesos/blob/72752fc6deb8ebcbfbd5448dc599ef3774339d31/src/scheduler/scheduler.cpp#L593
> {code}
> if (response->code == process::http::Status::NOT_FOUND) {
>   // This could happen if the master libprocess process has not yet set up
>   // HTTP routes.
>   LOG(WARNING) << "Received '" << response->status << "' ("
><< response->body << ") for " << call.type();
>   return;
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-7697) Mesos scheduler v1 HTTP API may generate 404 errors for temporary conditions

2018-03-27 Thread James DeFelice (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16415941#comment-16415941
 ] 

James DeFelice commented on MESOS-7697:
---

https://github.com/apache/mesos/blob/124c677c86c7b12ca4568f004895b8ca30d60dcf/3rdparty/libprocess/src/process.cpp#L3674

> Mesos scheduler v1 HTTP API may generate 404 errors for temporary conditions
> 
>
> Key: MESOS-7697
> URL: https://issues.apache.org/jira/browse/MESOS-7697
> Project: Mesos
>  Issue Type: Bug
>  Components: HTTP API, libprocess
>Reporter: James DeFelice
>Priority: Major
>  Labels: mesosphere
>
> Returning a 404 error for a condition that's a known temporary condition is 
> confusing from a client's perspective. A client wants to know how to recover 
> from various error conditions. A 404 error condition should be distinct from 
> a "server is not yet ready, but will be shortly" condition (which should 
> probably be reported as a 503 "unavailable" error).
> https://github.com/apache/mesos/blob/72752fc6deb8ebcbfbd5448dc599ef3774339d31/src/scheduler/scheduler.cpp#L593
> {code}
> if (response->code == process::http::Status::NOT_FOUND) {
>   // This could happen if the master libprocess process has not yet set up
>   // HTTP routes.
>   LOG(WARNING) << "Received '" << response->status << "' ("
><< response->body << ") for " << call.type();
>   return;
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8737) Implement a test to check recovery of a composing containerizer.

2018-03-27 Thread Andrei Budnik (JIRA)
Andrei Budnik created MESOS-8737:


 Summary: Implement a test to check recovery of a composing 
containerizer.
 Key: MESOS-8737
 URL: https://issues.apache.org/jira/browse/MESOS-8737
 Project: Mesos
  Issue Type: Task
Reporter: Andrei Budnik


This test verifies that if a recovered container terminates, then it is cleaned 
up and removed from `containers_` hash map in the composing containerizer.
This test should verify above-mentioned property using `containers()` method 
and should not use `destroy()` method.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8736) Implement a test which ensures that `wait` and `destroy` return the same result for a terminated nested container.

2018-03-27 Thread Andrei Budnik (JIRA)
Andrei Budnik created MESOS-8736:


 Summary: Implement a test which ensures that `wait` and `destroy` 
return the same result for a terminated nested container.
 Key: MESOS-8736
 URL: https://issues.apache.org/jira/browse/MESOS-8736
 Project: Mesos
  Issue Type: Task
Reporter: Andrei Budnik


This test launches a nested container using a composing containerizer, then 
checks that calling `destroy()` after `wait()` returns the same non-empty 
container termination status as for `wait()`. After that, it kills parent 
container and checks that both `destroy()` and `wait()` return an empty 
termination status.
Note that this test uses only Composing c'zer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8733) OversubscriptionTest.ForwardUpdateSlaveMessage is flaky

2018-03-27 Thread Benjamin Bannier (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16415841#comment-16415841
 ] 

Benjamin Bannier commented on MESOS-8733:
-

This seems to be related to the agent registering multiple times in these tests 
since the randomly choosen backoff factor was pick too low (similar to e.g., 
the issue reported in MESOS-8613).

A more detail log of a failure shows how small backoffs cause multiple 
registration attempts by the agent,
{noformat}
[ RUN  ] OversubscriptionTest.ForwardUpdateSlaveMessage
I0327 17:59:30.498467 14537 cluster.cpp:172] Creating default 'local' authorizer
I0327 17:59:30.517834 14564 master.cpp:463] Master 
f201dade-c73e-42ab-8379-fde33e1d6b29 (gru1.hw.ca1.mesosphere.com) started on 
192.99.40.208:39245
I0327 17:59:30.517881 14564 master.cpp:466] Flags at startup: --acls="" 
--agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
--allocation_interval="1secs" --allocator="HierarchicalDRF" 
--authenticate_agents="true" --authenticate_frameworks="true" 
--authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
--authenticate_http_readwrite="true" --authenticators="crammd5" 
--authorizers="local" --credentials="/tmp/5vjDkq/credentials" 
--filter_gpu_resources="true" --framework_sorter="drf" --help="false" 
--hostname_lookup="true" --http_authenticators="basic" 
--http_framework_authenticators="basic" --initialize_driver_logging="true" 
--log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" 
--max_agent_ping_timeouts="5" --max_completed_frameworks="50" 
--max_completed_tasks_per_framework="1000" 
--max_unreachable_tasks_per_framework="1000" --port="5050" --quiet="false" 
--recovery_agent_removal_limit="100%" --registry="in_memory" 
--registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
--registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
--registry_store_timeout="100secs" --registry_strict="false" 
--require_agent_domain="false" --root_submissions="true" --user_sorter="drf" 
--version="false" --webui_dir="/usr/local/share/mesos/webui" 
--work_dir="/tmp/5vjDkq/master" --zk_session_timeout="10secs"
I0327 17:59:30.518154 14564 master.cpp:515] Master only allowing authenticated 
frameworks to register
I0327 17:59:30.518162 14564 master.cpp:521] Master only allowing authenticated 
agents to register
I0327 17:59:30.518167 14564 master.cpp:527] Master only allowing authenticated 
HTTP frameworks to register
I0327 17:59:30.518172 14564 credentials.hpp:37] Loading credentials for 
authentication from '/tmp/5vjDkq/credentials'
I0327 17:59:30.518376 14564 master.cpp:571] Using default 'crammd5' 
authenticator
I0327 17:59:30.518471 14564 http.cpp:959] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-readonly'
I0327 17:59:30.518566 14564 http.cpp:959] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-readwrite'
I0327 17:59:30.518637 14564 http.cpp:959] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-scheduler'
I0327 17:59:30.518702 14564 master.cpp:652] Authorization enabled
I0327 17:59:30.520488 14559 whitelist_watcher.cpp:77] No whitelist given
I0327 17:59:30.520504 14568 hierarchical.cpp:175] Initialized hierarchical 
allocator process
I0327 17:59:30.522081 14568 master.cpp:2126] Elected as the leading master!
I0327 17:59:30.522099 14568 master.cpp:1682] Recovering from registrar
I0327 17:59:30.522241 14568 registrar.cpp:347] Recovering registrar
I0327 17:59:30.522748 14568 registrar.cpp:391] Successfully fetched the 
registry (0B) in 484864ns
I0327 17:59:30.522820 14568 registrar.cpp:495] Applied 1 operations in 27646ns; 
attempting to update the registry
I0327 17:59:30.523268 14568 registrar.cpp:552] Successfully updated the 
registry in 419840ns
I0327 17:59:30.523350 14568 registrar.cpp:424] Successfully recovered registrar
I0327 17:59:30.523630 14568 master.cpp:1796] Recovered 0 agents from the 
registry (170B); allowing 10mins for agents to reregister
I0327 17:59:30.523769 14568 hierarchical.cpp:213] Skipping recovery of 
hierarchical allocator: nothing to recover
W0327 17:59:30.552417 14537 process.cpp:2805] Attempted to spawn already 
running process files@192.99.40.208:39245
I0327 17:59:30.569679 14537 containerizer.cpp:304] Using isolation { 
environment_secret, filesystem/posix, network/cni, posix/cpu, posix/mem }
W0327 17:59:30.577472 14537 backend.cpp:76] Failed to create 'bind' backend: 
BindBackend requires root privileges
I0327 17:59:30.577519 14537 provisioner.cpp:299] Using default backend 'copy'
I0327 17:59:30.601388 14537 cluster.cpp:460] Creating default 'local' authorizer
I0327 17:59:30.604290 14553 slave.cpp:261] Mesos agent started on 
(49)@192.99.40.208:39245
I0327 17:59:30.604341 14553 slave.cpp:262] Flags at startup: --acls="" 
--appc_simple_discovery_uri_prefix="http://; 

[jira] [Created] (MESOS-8735) Implement recovery for resource provider manager registrar

2018-03-27 Thread Benjamin Bannier (JIRA)
Benjamin Bannier created MESOS-8735:
---

 Summary: Implement recovery for resource provider manager registrar
 Key: MESOS-8735
 URL: https://issues.apache.org/jira/browse/MESOS-8735
 Project: Mesos
  Issue Type: Task
  Components: agent, master, storage
Affects Versions: 1.6.0
Reporter: Benjamin Bannier
Assignee: Benjamin Bannier


In order to properly persist and recover resource provider information in the 
resource provider manager we should
 # Include a registrar in the manager, and
 # Implement missing recovery functionality in the registrar so it can return a 
recovered registry.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8734) Restore `WaitAfterDestroy` test to check termination status of a terminated nested container.

2018-03-27 Thread Andrei Budnik (JIRA)
Andrei Budnik created MESOS-8734:


 Summary: Restore `WaitAfterDestroy` test to check termination 
status of a terminated nested container.
 Key: MESOS-8734
 URL: https://issues.apache.org/jira/browse/MESOS-8734
 Project: Mesos
  Issue Type: Task
Reporter: Andrei Budnik


It's important to check that after termination of a nested container, its 
termination status is
available. This property is used in default executor.

Right now, if we remove [this section of 
code|https://github.com/apache/mesos/blob/5b655ce062ff55cdefed119d97ad923aeeb2efb5/src/slave/containerizer/mesos/containerizer.cpp#L2093-L2111],
 no test will be broken!

https://reviews.apache.org/r/65505



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8733) OversubscriptionTest.ForwardUpdateSlaveMessage is flaky

2018-03-27 Thread Benjamin Bannier (JIRA)
Benjamin Bannier created MESOS-8733:
---

 Summary: OversubscriptionTest.ForwardUpdateSlaveMessage is flaky
 Key: MESOS-8733
 URL: https://issues.apache.org/jira/browse/MESOS-8733
 Project: Mesos
  Issue Type: Bug
  Components: test
Affects Versions: 1.6.0
Reporter: Benjamin Bannier
Assignee: Benjamin Bannier


Observed this failure in CI,
{noformat}
[ RUN ] OversubscriptionTest.ForwardUpdateSlaveMessage
3: I0327 10:12:04.032042 18320 cluster.cpp:172] Creating default 'local' 
authorizer
3: I0327 10:12:04.035696 18321 master.cpp:463] Master 
b5c97327-11cc-4183-82ed-75e62b71cc58 (1931c74e0c4c) started on 172.17.0.2:35020
3: I0327 10:12:04.035732 18321 master.cpp:466] Flags at startup: --acls="" 
--agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
--allocation_interval="1secs" --allocator="HierarchicalDRF" 
--authenticate_agents="true" --authenticate_frameworks="true" 
--authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
--authenticate_http_readwrite="true" --authenticators="crammd5" 
--authorizers="local" --credentials="/tmp/4j65Va/credentials" 
--filter_gpu_resources="true" --framework_sorter="drf" --help="false" 
--hostname_lookup="true" --http_authenticators="basic" 
--http_framework_authenticators="basic" --initialize_driver_logging="true" 
--log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" 
--max_agent_ping_timeouts="5" --max_completed_frameworks="50" 
--max_completed_tasks_per_framework="1000" 
--max_unreachable_tasks_per_framework="1000" --port="5050" --quiet="false" 
--recovery_agent_removal_limit="100%" --registry="in_memory" 
--registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
--registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
--registry_store_timeout="100secs" --registry_strict="false" 
--require_agent_domain="false" --root_submissions="true" --user_sorter="drf" 
--version="false" --webui_dir="/usr/local/share/mesos/webui" 
--work_dir="/tmp/4j65Va/master" --zk_session_timeout="10secs"
3: I0327 10:12:04.036129 18321 master.cpp:515] Master only allowing 
authenticated frameworks to register
3: I0327 10:12:04.036140 18321 master.cpp:521] Master only allowing 
authenticated agents to register
3: I0327 10:12:04.036147 18321 master.cpp:527] Master only allowing 
authenticated HTTP frameworks to register
3: I0327 10:12:04.036156 18321 credentials.hpp:37] Loading credentials for 
authentication from '/tmp/4j65Va/credentials'
3: I0327 10:12:04.036468 18321 master.cpp:571] Using default 'crammd5' 
authenticator
3: I0327 10:12:04.036643 18321 http.cpp:959] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-readonly'
3: I0327 10:12:04.036834 18321 http.cpp:959] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-readwrite'
3: I0327 10:12:04.037005 18321 http.cpp:959] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-scheduler'
3: I0327 10:12:04.037170 18321 master.cpp:652] Authorization enabled
3: I0327 10:12:04.037370 18338 whitelist_watcher.cpp:77] No whitelist given
3: I0327 10:12:04.037374 18322 hierarchical.cpp:175] Initialized hierarchical 
allocator process
3: I0327 10:12:04.040787 18321 master.cpp:2126] Elected as the leading master!
3: I0327 10:12:04.040812 18321 master.cpp:1682] Recovering from registrar
3: I0327 10:12:04.040966 18342 registrar.cpp:347] Recovering registrar
3: I0327 10:12:04.041606 18330 registrar.cpp:391] Successfully fetched the 
registry (0B) in 590848ns
3: I0327 10:12:04.041764 18330 registrar.cpp:495] Applied 1 operations in 
57052ns; attempting to update the registry
3: I0327 10:12:04.042466 18330 registrar.cpp:552] Successfully updated the 
registry in 638976ns
3: I0327 10:12:04.042615 18330 registrar.cpp:424] Successfully recovered 
registrar
3: I0327 10:12:04.043128 18339 master.cpp:1796] Recovered 0 agents from the 
registry (135B); allowing 10mins for agents to reregister
3: I0327 10:12:04.043151 18326 hierarchical.cpp:213] Skipping recovery of 
hierarchical allocator: nothing to recover
3: W0327 10:12:04.048898 18320 process.cpp:2805] Attempted to spawn already 
running process files@172.17.0.2:35020
3: I0327 10:12:04.050076 18320 containerizer.cpp:304] Using isolation { 
environment_secret, posix/cpu, posix/mem, filesystem/posix, network/cni }
3: W0327 10:12:04.050720 18320 backend.cpp:76] Failed to create 'aufs' backend: 
AufsBackend requires root privileges
3: W0327 10:12:04.050746 18320 backend.cpp:76] Failed to create 'bind' backend: 
BindBackend requires root privileges
3: I0327 10:12:04.050791 18320 provisioner.cpp:299] Using default backend 'copy'
3: I0327 10:12:04.053491 18320 cluster.cpp:460] Creating default 'local' 
authorizer
3: I0327 10:12:04.056531 18326 slave.cpp:261] Mesos agent started on 
(546)@172.17.0.2:35020
3: I0327 10:12:04.056571 18326 slave.cpp:262] Flags at startup: 

[jira] [Commented] (MESOS-1466) Race between executor exited event and launch task can cause overcommit of resources

2018-03-27 Thread Benno Evers (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16415612#comment-16415612
 ] 

Benno Evers commented on MESOS-1466:


If I understand the issue correctly, this race seems to have been eliminated as 
a side-effect of introducing the `launch_executor` flag in Mesos 1.5:

When the master sends the `RunTaskMessage` to the agent, it thinks that the 
specified executor is still running on the agent, so it will set 
`launch_executor = false`:
{noformat}
// src/master/master.cpp:3841
bool Master::isLaunchExecutor(
    const ExecutorID& executorId,
    Framework* framework,
    Slave* slave) const
{
  CHECK_NOTNULL(framework);
  CHECK_NOTNULL(slave);

  if (!slave->hasExecutor(framework->id(), executorId)) {
    CHECK(!framework->hasExecutor(slave->id, executorId))
  << "Executor '" << executorId
  << "' known to the framework " << *framework
  << " but unknown to the agent " << *slave;

    return true;
  }

  return false;
}{noformat}
On the slave, when the executor doesn't exist anymore, the task is dropped with 
reason `REASON_EXECUTOR_TERMINATED`:
{noformat}
// src/slave/slave.cpp:2881

    // Master does not want to launch executor.
    if (executor == nullptr) {
  // Master wants no new executor launched and there is none running on
  // the agent. This could happen if the task expects some previous
  // tasks to launch the executor. However, the earlier task got killed
  // or dropped hence did not launch the executor but the master doesn't
  // know about it yet because the `ExitedExecutorMessage` is still in
  // flight. In this case, we will drop the task.
  //
  // We report TASK_DROPPED to the framework because the task was
  // never launched. For non-partition-aware frameworks, we report
  // TASK_LOST for backward compatibility.
  mesos::TaskState taskState = TASK_DROPPED;
  if (!protobuf::frameworkHasCapability(
  frameworkInfo, FrameworkInfo::Capability::PARTITION_AWARE)) {
    taskState = TASK_LOST;
  }

  foreach (const TaskInfo& _task, tasks) {
    const StatusUpdate update = protobuf::createStatusUpdate(
    frameworkId,
    info.id(),
    _task.task_id(),
    taskState,
    TaskStatus::SOURCE_SLAVE,
    id::UUID::random(),
    "No executor is expected to launch and there is none running",
    TaskStatus::REASON_EXECUTOR_TERMINATED,
    executorId);

    statusUpdate(update, UPID());
  }

  // We do not send `ExitedExecutorMessage` here because the expectation
  // is that there is already one on the fly to master. If the message
  // gets dropped, we will hopefully reconcile with the master later.

  return;
    }{noformat}

> Race between executor exited event and launch task can cause overcommit of 
> resources
> 
>
> Key: MESOS-1466
> URL: https://issues.apache.org/jira/browse/MESOS-1466
> Project: Mesos
>  Issue Type: Bug
>  Components: allocation, master
>Reporter: Vinod Kone
>Priority: Major
>  Labels: reliability, twitter
>
> The following sequence of events can cause an overcommit
> --> Launch task is called for a task whose executor is already running
> --> Executor's resources are not accounted for on the master
> --> Executor exits and the event is enqueued behind launch tasks on the master
> --> Master sends the task to the slave which needs to commit for resources 
> for task and the (new) executor.
> --> Master processes the executor exited event and re-offers the executor's 
> resources causing an overcommit of resources.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-1466) Race between executor exited event and launch task can cause overcommit of resources

2018-03-27 Thread Benno Evers (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benno Evers reassigned MESOS-1466:
--

Resolution: Fixed
  Assignee: Meng Zhu

> Race between executor exited event and launch task can cause overcommit of 
> resources
> 
>
> Key: MESOS-1466
> URL: https://issues.apache.org/jira/browse/MESOS-1466
> Project: Mesos
>  Issue Type: Bug
>  Components: allocation, master
>Reporter: Vinod Kone
>Assignee: Meng Zhu
>Priority: Major
>  Labels: reliability, twitter
>
> The following sequence of events can cause an overcommit
> --> Launch task is called for a task whose executor is already running
> --> Executor's resources are not accounted for on the master
> --> Executor exits and the event is enqueued behind launch tasks on the master
> --> Master sends the task to the slave which needs to commit for resources 
> for task and the (new) executor.
> --> Master processes the executor exited event and re-offers the executor's 
> resources causing an overcommit of resources.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8732) Use composing containerizer by default in tests.

2018-03-27 Thread Andrei Budnik (JIRA)
Andrei Budnik created MESOS-8732:


 Summary: Use composing containerizer by default in tests.
 Key: MESOS-8732
 URL: https://issues.apache.org/jira/browse/MESOS-8732
 Project: Mesos
  Issue Type: Task
  Components: containerization
Reporter: Andrei Budnik


If we assign "docker,mesos" to the `containerizers` flag for an agent, then 
`ComposingContainerizer` will be used for many tests that do not specify 
`containerizers` flag. That's the goal of this task.

I tried to do that by adding [`flags.containerizers = 
"docker,mesos";`|https://github.com/apache/mesos/blob/master/src/tests/mesos.cpp#L273],
 but it turned out that some tests are started to hang due to a paused clocks, 
while docker c'zer and docker library use libprocess clocks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-7616) Consider supporting changes to agent's domain without full drain.

2018-03-27 Thread Benno Evers (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16415463#comment-16415463
 ] 

Benno Evers commented on MESOS-7616:


Bookkeeping note: I've assigned the same number of story points to this and the 
corresponding epic MESOS-1739, please correct if this isn't the correct 
accounting method @[~vinodkone].

> Consider supporting changes to agent's domain without full drain.
> -
>
> Key: MESOS-7616
> URL: https://issues.apache.org/jira/browse/MESOS-7616
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Neil Conway
>Assignee: Benno Evers
>Priority: Major
>  Labels: mesosphere
> Fix For: 1.5.0
>
>
> In the initial review chain, any change to an agent's domain requires a full 
> drain. This is simple and straightforward, but it makes it more difficult for 
> operators to opt-in to using fault domains.
> We should consider allowing agents to transition from "no configured domain" 
> to "configured domain" without requiring an agent drain. This has some 
> complications, however: e.g., without an API for communicating changes in an 
> agent's configuration to frameworks, they might not realize that an agent's 
> domain has changed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8731) mesos master APIs become latent

2018-03-27 Thread sri krishna (JIRA)
sri krishna created MESOS-8731:
--

 Summary: mesos master APIs become latent
 Key: MESOS-8731
 URL: https://issues.apache.org/jira/browse/MESOS-8731
 Project: Mesos
  Issue Type: Bug
  Components: master
Affects Versions: 1.5.0, 1.4.0
Reporter: sri krishna


Over a period of time one of the UI API call to the master becomes latent. 
Normally the request that takes less than a second takes up to 20 seconds 
during peak. A lot of the dev team access the UI for logs.

Below are my observations :

In mesos "0.28.1-2.0.20.ubuntu1404"



# ab -n 1000 -c 10 
"http://mesos-master1.mesos.bla.net:5050/metrics/snapshot?jsonp=angular.callbacks._4g;
This is ApacheBench, Version 2.3 <$Revision: 1528965 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking mesos-master1.mesos.bla.net (be patient)
Completed 100 requests
Completed 200 requests
Completed 300 requests
Completed 400 requests
Completed 500 requests
Completed 600 requests
Completed 700 requests
Completed 800 requests
Completed 900 requests
Completed 1000 requests
Finished 1000 requests


Server Software:
Server Hostname: mesos-master1.mesos.bla.net
Server Port: 5050

Document Path: /metrics/snapshot?jsonp=angular.callbacks._4g
Document Length: 3197 bytes

Concurrency Level: 10
Time taken for tests: 501.010 seconds
Complete requests: 1000
Failed requests: 954
 (Connect: 0, Receive: 0, Length: 954, Exceptions: 0)
Total transferred: 3304510 bytes
HTML transferred: 3195510 bytes
Requests per second: 2.00 [#/sec] (mean)
Time per request: 5010.104 [ms] (mean)
Time per request: 501.010 [ms] (mean, across all concurrent requests)
Transfer rate: 6.44 [Kbytes/sec] received

Connection Times (ms)
 min mean[+/-sd] median max
Connect: 0 0 0.0 0 0
Processing: 321 4987 286.4 5007 5508
Waiting: 321 4987 286.4 5007 5508
Total: 321 4988 286.4 5007 5508

Percentage of the requests served within a certain time (ms)
 50% 5007
 66% 5007
 75% 5008
 80% 5008
 90% 5008
 95% 5009
 98% 5010
 99% 5506
 100% 5508 (longest request)



 

In mesos 1.4 and 1.5 (versions 1.4.0-2.0.1 and 1.5.0-2.0.1) the response of 
these APIs is quite high. 



# ab -n 1000 -c 10 
"http://mesos-master3.stage.bla.net:5050/metrics/snapshot?jsonp=angular.callbacks._4g;
This is ApacheBench, Version 2.3 <$Revision: 1706008 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking mesos-master3.stage.bla.net (be patient)
Completed 100 requests
Completed 200 requests
Completed 300 requests
Completed 400 requests
Completed 500 requests
^C

Server Software:
Server Hostname: mesos-master3.stage.bla.net
Server Port: 5050

Document Path: /metrics/snapshot?jsonp=angular.callbacks._4g
Document Length: 6596 bytes

Concurrency Level: 10
Time taken for tests: 1405.182 seconds
Complete requests: 582
Failed requests: 580
 (Connect: 0, Receive: 0, Length: 580, Exceptions: 0)
Total transferred: 3909986 bytes
HTML transferred: 3846548 bytes
Requests per second: 0.41 [#/sec] (mean)
Time per request: 24144.024 [ms] (mean)
Time per request: 2414.402 [ms] (mean, across all concurrent requests)
Transfer rate: 2.72 [Kbytes/sec] received

Connection Times (ms)
 min mean[+/-sd] median max
Connect: 0 0 0.0 0 0
Processing: 15284 24058 2600.7 23937 31740
Waiting: 15284 24058 2600.7 23937 31740
Total: 15284 24059 2600.7 23938 31740

Percentage of the requests served within a certain time (ms)
 50% 23938
 66% 25074
 75% 25729
 80% 26465
 90% 27605
 95% 28215
 98% 29685
 99% 30595
 100% 31740 (longest request)



I think this is causing the others APIs like "/master/slaves/ and "/metrics" to 
become latent. 

At this point we are forcing a re-elect of the the master to bring the times 
down. What can I do to bring this times down? The load on the box is quite 
less. The load average does not cross 2 on a 8 core box.

Let me know if any further info is required. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8557) Default executor should allow decreasing the escalation grace period of a terminating task

2018-03-27 Thread Alexander Rukletsov (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16415345#comment-16415345
 ] 

Alexander Rukletsov commented on MESOS-8557:


{noformat}
commit 4e7bbe67f55fbaa560466fc1d0a2f5e5bdb6ab32
Author: Gaston Kleiman 
AuthorDate: Tue Mar 27 11:38:13 2018 +0200
Commit: Alexander Rukletsov 
CommitDate: Tue Mar 27 11:38:13 2018 +0200

Added a reference to MESOS-8557 to the default executor.

Review: https://reviews.apache.org/r/66235/
{noformat}

> Default executor should allow decreasing the escalation grace period of a 
> terminating task
> --
>
> Key: MESOS-8557
> URL: https://issues.apache.org/jira/browse/MESOS-8557
> Project: Mesos
>  Issue Type: Bug
>Reporter: Gastón Kleiman
>Priority: Major
>  Labels: default-executor, gracefulshutdown, mesosphere
>
> The command executor supports [decreasing the escalation grace period of a 
> terminating 
> task|https://github.com/apache/mesos/blob/c665dd6c22715fa941200020a8f7209f1f5b1ca1/src/launcher/executor.cpp#L800-L803].
> For consistency, this should also be supported by the default executor.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-8534) Allow nested containers in TaskGroups to have separate network namespaces

2018-03-27 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu reassigned MESOS-8534:
-

Assignee: Sagar Sadashiv Patwardhan

> Allow nested containers in TaskGroups to have separate network namespaces
> -
>
> Key: MESOS-8534
> URL: https://issues.apache.org/jira/browse/MESOS-8534
> Project: Mesos
>  Issue Type: Task
>  Components: containerization
>Reporter: Sagar Sadashiv Patwardhan
>Assignee: Sagar Sadashiv Patwardhan
>Priority: Minor
>  Labels: cni
> Fix For: 1.6.0
>
>
> As per the discussion with [~jieyu] and [~avinash.mesos] , I am going to 
> allow nested containers in TaskGroups to have separate namespaces. I am also 
> going to retain the existing functionality, where nested containers can share 
> namespaces with the parent/root container.
> *Use case:* At Yelp, we have this application called seagull that runs 
> multiple tasks in parallel. It is mainly used for running tests that depend 
> on other containerized internal microservices. It was developed before mesos 
> had support for docker-executor. So, it uses a custom executor, which 
> directly talks to docker daemon on the host and run a bunch of service 
> containers along with the process where tests are executed. Resources for all 
> these containers are not accounted for in mesos. Clean-up of these containers 
> is also a headache. We have a tool called docker-reaper that automatically 
> reaps the orphaned containers once the executor goes away. In addition to 
> that, we also run a few cron jobs that clean-up any leftover containers.
> We are in the process of containerizing the process that runs the tests. We 
> also want to delegate the responsibility of lifecycle management of docker 
> containers to mesos and get rid of the custom executor. We looked at a few 
> alternatives to do this and decided to go with pods because they provide 
> all-or-nothing(atomicity) semantics that we need for our application. But, we 
> cannot use pods directly because all the containers in a pod have the same 
> network namespace. The service discovery mechanism requires all the 
> containers to have separate IPs. All of our microservices bind to  
> container port, so we will have port collision unless we are giving separate 
> namespaces to all the containers in a pod.
> *Proposal:* I am planning to allow nested containers to have separate 
> namespaces. If NetworkInfo protobuf for nested containers is not empty, then 
> we will assign separate mnt and network namespaces to the nested containers. 
> Otherwise,  they will share the network and mount namepsaces with the 
> parent/root container.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8534) Allow nested containers in TaskGroups to have separate network namespaces

2018-03-27 Thread Jie Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16415104#comment-16415104
 ] 

Jie Yu commented on MESOS-8534:
---

commit a741b15e889de3242e3aa7878105ab9d946f6ea2 (HEAD -> master, origin/master, 
origin/HEAD)
Author: Sagar Patwardhan 
Date:   Mon Mar 26 21:13:17 2018 -0700

Allowed a nested container to have a separate network namespace.

Previously, nested containers always share the same network namespace as
their parent. This patch allows a nested container to have a separate
network namespace than its parent.

Continued from https://github.com/apache/mesos/pull/263

JIRA: MESOS-8534

Review: https://reviews.apache.org/r/65987/

commit 020b8cbafaf70ef4b95915bf9b81200509b23a50
Author: Jie Yu 
Date:   Mon Mar 26 23:28:20 2018 -0700

Fixed createVolumeHostPath helper.

commit 77c56351e9bfabea221c6be84472e64b434b5169
Author: Jie Yu 
Date:   Mon Mar 26 21:14:52 2018 -0700

Added a helper to parse ContainerID.

Review: https://reviews.apache.org/r/66101/

> Allow nested containers in TaskGroups to have separate network namespaces
> -
>
> Key: MESOS-8534
> URL: https://issues.apache.org/jira/browse/MESOS-8534
> Project: Mesos
>  Issue Type: Task
>  Components: containerization
>Reporter: Sagar Sadashiv Patwardhan
>Priority: Minor
>  Labels: cni
>
> As per the discussion with [~jieyu] and [~avinash.mesos] , I am going to 
> allow nested containers in TaskGroups to have separate namespaces. I am also 
> going to retain the existing functionality, where nested containers can share 
> namespaces with the parent/root container.
> *Use case:* At Yelp, we have this application called seagull that runs 
> multiple tasks in parallel. It is mainly used for running tests that depend 
> on other containerized internal microservices. It was developed before mesos 
> had support for docker-executor. So, it uses a custom executor, which 
> directly talks to docker daemon on the host and run a bunch of service 
> containers along with the process where tests are executed. Resources for all 
> these containers are not accounted for in mesos. Clean-up of these containers 
> is also a headache. We have a tool called docker-reaper that automatically 
> reaps the orphaned containers once the executor goes away. In addition to 
> that, we also run a few cron jobs that clean-up any leftover containers.
> We are in the process of containerizing the process that runs the tests. We 
> also want to delegate the responsibility of lifecycle management of docker 
> containers to mesos and get rid of the custom executor. We looked at a few 
> alternatives to do this and decided to go with pods because they provide 
> all-or-nothing(atomicity) semantics that we need for our application. But, we 
> cannot use pods directly because all the containers in a pod have the same 
> network namespace. The service discovery mechanism requires all the 
> containers to have separate IPs. All of our microservices bind to  
> container port, so we will have port collision unless we are giving separate 
> namespaces to all the containers in a pod.
> *Proposal:* I am planning to allow nested containers to have separate 
> namespaces. If NetworkInfo protobuf for nested containers is not empty, then 
> we will assign separate mnt and network namespaces to the nested containers. 
> Otherwise,  they will share the network and mount namepsaces with the 
> parent/root container.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)