[jira] [Updated] (MESOS-6170) Health check grace period covers failures happening after first success.

2016-09-29 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-6170:
---
Sprint: Mesosphere Sprint 44
Issue Type: Bug  (was: Improvement)

> Health check grace period covers failures happening after first success.
> 
>
> Key: MESOS-6170
> URL: https://issues.apache.org/jira/browse/MESOS-6170
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.0.0
>Reporter: Alexander Rukletsov
>Assignee: Gastón Kleiman
>  Labels: health-check, mesosphere
> Fix For: 1.1.0
>
>
> Currently, the health check library [ignores *all* 
> failures|https://github.com/apache/mesos/blob/b053572bc424478cafcd60d1bce078f5132c4590/src/health-check/health_checker.cpp#L192-L197]
>  from the task’s start (technically from the health check library 
> initialization) [until after the grace period 
> ends|https://github.com/apache/mesos/blob/b053572bc424478cafcd60d1bce078f5132c4590/include/mesos/v1/mesos.proto#L403].
> This behaviour is misleading. Once the health check succeeds for the first 
> time, grace period rule for failures should not be applied any more.
> For example, if the grace period is set to 10 minutes, the task becomes 
> healthy after 1 minute and fails after 2 minutes, the failure should be 
> treated as a normal failure with all the consequences.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6142) Frameworks may RESERVE for an arbitrary role.

2016-09-29 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-6142:
---
Shepherd: Alexander Rukletsov

> Frameworks may RESERVE for an arbitrary role.
> -
>
> Key: MESOS-6142
> URL: https://issues.apache.org/jira/browse/MESOS-6142
> Project: Mesos
>  Issue Type: Bug
>  Components: allocation, master
>Affects Versions: 1.0.0
>Reporter: Alexander Rukletsov
>Assignee: Gastón Kleiman
>Priority: Blocker
>  Labels: mesosphere, reservations
> Fix For: 1.1.0
>
>
> The master does not validate that resources from a reservation request have 
> the same role the framework is registered with. As a result, frameworks may 
> reserve resources for arbitrary roles.
> I've modified the role in [the {{ReserveThenUnreserve}} 
> test|https://github.com/apache/mesos/blob/bca600cf5602ed8227d91af9f73d689da14ad786/src/tests/reservation_tests.cpp#L117]
>  to "yoyo" and observed the following in the test's log:
> {noformat}
> I0908 18:35:43.379122 2138112 master.cpp:3362] Processing ACCEPT call for 
> offers: [ dfaf67e6-7c1c-4988-b427-c49842cb7bb7-O0 ] on agent 
> dfaf67e6-7c1c-4988-b427-c49842cb7bb7-S0 at slave(1)@10.200.181.237:60116 
> (alexr.railnet.train) for framework dfaf67e6-7c1c-4988-b427-c49842cb7bb7- 
> (default) at 
> scheduler-ca12a660-9f08-49de-be4e-d452aa3aa6da@10.200.181.237:60116
> I0908 18:35:43.379170 2138112 master.cpp:3022] Authorizing principal 
> 'test-principal' to reserve resources 'cpus(yoyo, test-principal):1; 
> mem(yoyo, test-principal):512'
> I0908 18:35:43.379678 2138112 master.cpp:3642] Applying RESERVE operation for 
> resources cpus(yoyo, test-principal):1; mem(yoyo, test-principal):512 from 
> framework dfaf67e6-7c1c-4988-b427-c49842cb7bb7- (default) at 
> scheduler-ca12a660-9f08-49de-be4e-d452aa3aa6da@10.200.181.237:60116 to agent 
> dfaf67e6-7c1c-4988-b427-c49842cb7bb7-S0 at slave(1)@10.200.181.237:60116 
> (alexr.railnet.train)
> I0908 18:35:43.379767 2138112 master.cpp:7341] Sending checkpointed resources 
> cpus(yoyo, test-principal):1; mem(yoyo, test-principal):512 to agent 
> dfaf67e6-7c1c-4988-b427-c49842cb7bb7-S0 at slave(1)@10.200.181.237:60116 
> (alexr.railnet.train)
> I0908 18:35:43.380273 3211264 slave.cpp:2497] Updated checkpointed resources 
> from  to cpus(yoyo, test-principal):1; mem(yoyo, test-principal):512
> I0908 18:35:43.380574 2674688 hierarchical.cpp:760] Updated allocation of 
> framework dfaf67e6-7c1c-4988-b427-c49842cb7bb7- on agent 
> dfaf67e6-7c1c-4988-b427-c49842cb7bb7-S0 from cpus(*):1; mem(*):512; 
> disk(*):470841; ports(*):[31000-32000] to ports(*):[31000-32000]; cpus(yoyo, 
> test-principal):1; disk(*):470841; mem(yoyo, test-principal):512 with RESERVE 
> operation
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6157) ContainerInfo is not validated.

2016-09-29 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-6157:
---
Priority: Blocker  (was: Major)

> ContainerInfo is not validated.
> ---
>
> Key: MESOS-6157
> URL: https://issues.apache.org/jira/browse/MESOS-6157
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.0.0
>Reporter: Alexander Rukletsov
>Assignee: Alexander Rukletsov
>Priority: Blocker
>  Labels: containerizer, mesos-containerizer, mesosphere
> Fix For: 1.1.0
>
>
> Currently Mesos does not validate {{ContainerInfo}} provided with 
> {{TaskInfo}} or {{ExecutorInfo}}, hence invalid task configurations can be 
> accepted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6119) TCP health checks are not portable.

2016-09-29 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-6119:
---
 Priority: Blocker  (was: Major)
Fix Version/s: 1.1.0

> TCP health checks are not portable.
> ---
>
> Key: MESOS-6119
> URL: https://issues.apache.org/jira/browse/MESOS-6119
> Project: Mesos
>  Issue Type: Bug
>Reporter: Alexander Rukletsov
>Assignee: Alexander Rukletsov
>Priority: Blocker
>  Labels: health-check, mesosphere
> Fix For: 1.1.0
>
>
> MESOS-3567 introduced a dependency on "bash" for TCP health checks, which is 
> undesirable. We should implement a portable solution for TCP health checks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6184) Health checks should use a general mechanism to enter namespaces of the task.

2016-09-29 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-6184:
---
Priority: Blocker  (was: Major)

> Health checks should use a general mechanism to enter namespaces of the task.
> -
>
> Key: MESOS-6184
> URL: https://issues.apache.org/jira/browse/MESOS-6184
> Project: Mesos
>  Issue Type: Improvement
>Reporter: haosdent
>Assignee: haosdent
>Priority: Blocker
>  Labels: health-check, mesosphere
> Fix For: 1.1.0
>
>
> To perform health checks for tasks, we need to enter the corresponding 
> namespaces of the container. For now health check use custom clone to 
> implement this
> {code}
>   return process::defaultClone([=]() -> int {
> if (taskPid.isSome()) {
>   foreach (const string& ns, namespaces) {
> Try setns = ns::setns(taskPid.get(), ns);
> if (setns.isError()) {
>   ...
> }
>   }
> }
> return func();
>   });
> {code}
> After the childHooks patches merged, we could change the health check to use 
> childHooks to call {{setns}} and make {{process::defaultClone}} private 
> again.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6184) Health checks should use a general mechanism to enter namespaces of the task.

2016-09-29 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-6184:
---
Sprint: Mesosphere Sprint 44

> Health checks should use a general mechanism to enter namespaces of the task.
> -
>
> Key: MESOS-6184
> URL: https://issues.apache.org/jira/browse/MESOS-6184
> Project: Mesos
>  Issue Type: Improvement
>Reporter: haosdent
>Assignee: haosdent
>Priority: Blocker
>  Labels: health-check, mesosphere
> Fix For: 1.1.0
>
>
> To perform health checks for tasks, we need to enter the corresponding 
> namespaces of the container. For now health check use custom clone to 
> implement this
> {code}
>   return process::defaultClone([=]() -> int {
> if (taskPid.isSome()) {
>   foreach (const string& ns, namespaces) {
> Try setns = ns::setns(taskPid.get(), ns);
> if (setns.isError()) {
>   ...
> }
>   }
> }
> return func();
>   });
> {code}
> After the childHooks patches merged, we could change the health check to use 
> childHooks to call {{setns}} and make {{process::defaultClone}} private 
> again.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5320) SSL related error messages can be misguiding or incomplete

2016-09-29 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-5320:
---
Shepherd: Alexander Rukletsov
  Sprint: Mesosphere Sprint 44
Story Points: 3

> SSL related error messages can be misguiding or incomplete
> --
>
> Key: MESOS-5320
> URL: https://issues.apache.org/jira/browse/MESOS-5320
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.0.0
>Reporter: Till Toenshoff
>Assignee: Till Toenshoff
>  Labels: ssl
>
> I was trying to activate SSL within Mesos but had rendered an invalid 
> certificate, it was signed with a mismatching key. Once I started the master, 
> the error message I received was rather confusing to me:
> {noformat}
> W0503 10:15:58.027343  6696 openssl.cpp:363] Failed SSL connections will be 
> downgraded to a non-SSL socket
> Could not load key file
> {noformat} 
> To me, this error message hinted that the key file was not existing or had 
> rights issues. However, a quick {{strace}} revealed  that the key-file was 
> properly accessed, no sign of a file-not-found or alike.
> The problem here is the hardcoded error-message, not taking OpenSSL's human 
> readable error strings into account.
> The code that misguided me is located at  
> https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/openssl.cpp#L471
> We might want to change
> {noformat}
>   // Set private key.
>   if (SSL_CTX_use_PrivateKey_file(
>   ctx,
>   ssl_flags->key_file.get().c_str(),
>   SSL_FILETYPE_PEM) != 1) {
> EXIT(EXIT_FAILURE) << "Could not load key file";
>   }
> {noformat}
> Towards something like this
> {noformat}
>   // Set private key.
>   if (SSL_CTX_use_PrivateKey_file(
>   ctx,
>   ssl_flags->key_file.get().c_str(),
>   SSL_FILETYPE_PEM) != 1) {
> EXIT(EXIT_FAILURE) << "Could not use key file: " << 
> ERR_error_string(ERR_get_error(), NULL);
>   }
> {noformat}
> To receive a much more helpful message like this
> {noformat}
> W0503 13:18:12.551364 11572 openssl.cpp:363] Failed SSL connections will be 
> downgraded to a non-SSL socket
> Could not use key file: error:0B080074:x509 certificate 
> routines:X509_check_private_key:key values mismatch
> {noformat}
> A quick scan of the implementation within {{openssl.cpp}} to me suggests that 
> there are more places that we might want to update with more deterministic 
> error messages. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6035) Add non-recursive version of cgroups::get

2016-09-29 Thread Alexander Rukletsov (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15533412#comment-15533412
 ] 

Alexander Rukletsov commented on MESOS-6035:


{noformat}
Commit: e042aa071a77ef1922d9b1a93f6e8adf221979b3 [e042aa0]
Author: haosdent huang haosd...@gmail.com
Date: 29 September 2016 at 19:24:51 GMT+2
Committer: Alexander Rukletsov al...@apache.org

Removed the expired TODO about non-recursive version cgroups::get.

Review: https://reviews.apache.org/r/51185/
{noformat}

> Add non-recursive version of cgroups::get
> -
>
> Key: MESOS-6035
> URL: https://issues.apache.org/jira/browse/MESOS-6035
> Project: Mesos
>  Issue Type: Improvement
>Reporter: haosdent
>Assignee: haosdent
>Priority: Minor
>
> In some cases, we only need to get the top level cgroups instead of to get 
> all cgroups recursively. Add a non-recursive version could help to avoid 
> unnecessary paths.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6293) HealthCheckTest.HealthyTaskViaHTTPWithoutType fails on some distros.

2016-09-30 Thread Alexander Rukletsov (JIRA)
Alexander Rukletsov created MESOS-6293:
--

 Summary: HealthCheckTest.HealthyTaskViaHTTPWithoutType fails on 
some distros.
 Key: MESOS-6293
 URL: https://issues.apache.org/jira/browse/MESOS-6293
 Project: Mesos
  Issue Type: Bug
Reporter: Alexander Rukletsov


I see consistent failures of this test in the internal CI in *some* distros, 
specifically CentOS 6, Ubuntu 14, 15, 16. The source of the health check 
failure is always the same: {{curl}} cannot connect to the target:
{noformat}
Received task health update, healthy: false
W0929 17:22:05.270992  2730 health_checker.cpp:204] Health check failed 1 times 
consecutively: HTTP health check failed: curl returned exited with status 7: 
curl: (7) couldn't connect to host
I0929 17:22:05.273634 26850 slave.cpp:3609] Handling status update TASK_RUNNING 
(UUID: f5408ac9-f6ba-447f-b3d7-9dce44384ffe) for task 
aa0792d3-8d85-4c32-bd04-56a9b552ebda in health state unhealthy of framework 
2e0e9ea1-0ae5-4f28-80bb-a9abc56c5a6f- from executor(1)@172.30.2.20:58660
I0929 17:22:05.274178 26844 status_update_manager.cpp:323] Received status 
update TASK_RUNNING (UUID: f5408ac9-f6ba-447f-b3d7-9dce44384ffe) for task 
aa0792d3-8d85-4c32-bd04-56a9b552ebda in health state unhealthy of framework 
2e0e9ea1-0ae5-4f28-80bb-a9abc56c5a6f-
I0929 17:22:05.274226 26844 status_update_manager.cpp:377] Forwarding update 
TASK_RUNNING (UUID: f5408ac9-f6ba-447f-b3d7-9dce44384ffe) for task 
aa0792d3-8d85-4c32-bd04-56a9b552ebda in health state unhealthy of framework 
2e0e9ea1-0ae5-4f28-80bb-a9abc56c5a6f- to the agent
I0929 17:22:05.274314 26845 slave.cpp:4026] Forwarding the update TASK_RUNNING 
(UUID: f5408ac9-f6ba-447f-b3d7-9dce44384ffe) for task 
aa0792d3-8d85-4c32-bd04-56a9b552ebda in health state unhealthy of framework 
2e0e9ea1-0ae5-4f28-80bb-a9abc56c5a6f- to master@172.30.2.20:38955
I0929 17:22:05.274415 26845 slave.cpp:3920] Status update manager successfully 
handled status update TASK_RUNNING (UUID: f5408ac9-f6ba-447f-b3d7-9dce44384ffe) 
for task aa0792d3-8d85-4c32-bd04-56a9b552ebda in health state unhealthy of 
framework 2e0e9ea1-0ae5-4f28-80bb-a9abc56c5a6f-
I0929 17:22:05.274436 26845 slave.cpp:3936] Sending acknowledgement for status 
update TASK_RUNNING (UUID: f5408ac9-f6ba-447f-b3d7-9dce44384ffe) for task 
aa0792d3-8d85-4c32-bd04-56a9b552ebda in health state unhealthy of framework 
2e0e9ea1-0ae5-4f28-80bb-a9abc56c5a6f- to executor(1)@172.30.2.20:58660
I0929 17:22:05.274534 26849 master.cpp:5661] Status update TASK_RUNNING (UUID: 
f5408ac9-f6ba-447f-b3d7-9dce44384ffe) for task 
aa0792d3-8d85-4c32-bd04-56a9b552ebda in health state unhealthy of framework 
2e0e9ea1-0ae5-4f28-80bb-a9abc56c5a6f- from agent 
2e0e9ea1-0ae5-4f28-80bb-a9abc56c5a6f-S0 at slave(77)@172.30.2.20:38955 
(ip-172-30-2-20.mesosphere.io)
../../src/tests/health_check_tests.cpp:1398: Failure
I0929 17:22:05.274567 26849 master.cpp:5723] Forwarding status update 
TASK_RUNNING (UUID: f5408ac9-f6ba-447f-b3d7-9dce44384ffe) for task 
aa0792d3-8d85-4c32-bd04-56a9b552ebda in health state unhealthy of framework 
2e0e9ea1-0ae5-4f28-80bb-a9abc56c5a6f-
Value of: statusHealth.get().healthy()
  Actual: false
  Expected: true
I0929 17:22:05.274636 26849 master.cpp:7560] Updating the state of task 
aa0792d3-8d85-4c32-bd04-56a9b552ebda of framework 
2e0e9ea1-0ae5-4f28-80bb-a9abc56c5a6f- (latest state: TASK_RUNNING, status 
update state: TASK_RUNNING)
I0929 17:22:05.274829 26844 sched.cpp:1025] Scheduler::statusUpdate took 43297ns
Received SHUTDOWN event
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6293) HealthCheckTest.HealthyTaskViaHTTPWithoutType fails on some distros.

2016-09-30 Thread Alexander Rukletsov (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15535935#comment-15535935
 ] 

Alexander Rukletsov commented on MESOS-6293:


Run that test 1000 times on the CentOS 6 and could not reproduce the failure.

> HealthCheckTest.HealthyTaskViaHTTPWithoutType fails on some distros.
> 
>
> Key: MESOS-6293
> URL: https://issues.apache.org/jira/browse/MESOS-6293
> Project: Mesos
>  Issue Type: Bug
>Reporter: Alexander Rukletsov
>Assignee: haosdent
>  Labels: health-check, mesosphere
>
> I see consistent failures of this test in the internal CI in *some* distros, 
> specifically CentOS 6, Ubuntu 14, 15, 16. The source of the health check 
> failure is always the same: {{curl}} cannot connect to the target:
> {noformat}
> Received task health update, healthy: false
> W0929 17:22:05.270992  2730 health_checker.cpp:204] Health check failed 1 
> times consecutively: HTTP health check failed: curl returned exited with 
> status 7: curl: (7) couldn't connect to host
> I0929 17:22:05.273634 26850 slave.cpp:3609] Handling status update 
> TASK_RUNNING (UUID: f5408ac9-f6ba-447f-b3d7-9dce44384ffe) for task 
> aa0792d3-8d85-4c32-bd04-56a9b552ebda in health state unhealthy of framework 
> 2e0e9ea1-0ae5-4f28-80bb-a9abc56c5a6f- from executor(1)@172.30.2.20:58660
> I0929 17:22:05.274178 26844 status_update_manager.cpp:323] Received status 
> update TASK_RUNNING (UUID: f5408ac9-f6ba-447f-b3d7-9dce44384ffe) for task 
> aa0792d3-8d85-4c32-bd04-56a9b552ebda in health state unhealthy of framework 
> 2e0e9ea1-0ae5-4f28-80bb-a9abc56c5a6f-
> I0929 17:22:05.274226 26844 status_update_manager.cpp:377] Forwarding update 
> TASK_RUNNING (UUID: f5408ac9-f6ba-447f-b3d7-9dce44384ffe) for task 
> aa0792d3-8d85-4c32-bd04-56a9b552ebda in health state unhealthy of framework 
> 2e0e9ea1-0ae5-4f28-80bb-a9abc56c5a6f- to the agent
> I0929 17:22:05.274314 26845 slave.cpp:4026] Forwarding the update 
> TASK_RUNNING (UUID: f5408ac9-f6ba-447f-b3d7-9dce44384ffe) for task 
> aa0792d3-8d85-4c32-bd04-56a9b552ebda in health state unhealthy of framework 
> 2e0e9ea1-0ae5-4f28-80bb-a9abc56c5a6f- to master@172.30.2.20:38955
> I0929 17:22:05.274415 26845 slave.cpp:3920] Status update manager 
> successfully handled status update TASK_RUNNING (UUID: 
> f5408ac9-f6ba-447f-b3d7-9dce44384ffe) for task 
> aa0792d3-8d85-4c32-bd04-56a9b552ebda in health state unhealthy of framework 
> 2e0e9ea1-0ae5-4f28-80bb-a9abc56c5a6f-
> I0929 17:22:05.274436 26845 slave.cpp:3936] Sending acknowledgement for 
> status update TASK_RUNNING (UUID: f5408ac9-f6ba-447f-b3d7-9dce44384ffe) for 
> task aa0792d3-8d85-4c32-bd04-56a9b552ebda in health state unhealthy of 
> framework 2e0e9ea1-0ae5-4f28-80bb-a9abc56c5a6f- to 
> executor(1)@172.30.2.20:58660
> I0929 17:22:05.274534 26849 master.cpp:5661] Status update TASK_RUNNING 
> (UUID: f5408ac9-f6ba-447f-b3d7-9dce44384ffe) for task 
> aa0792d3-8d85-4c32-bd04-56a9b552ebda in health state unhealthy of framework 
> 2e0e9ea1-0ae5-4f28-80bb-a9abc56c5a6f- from agent 
> 2e0e9ea1-0ae5-4f28-80bb-a9abc56c5a6f-S0 at slave(77)@172.30.2.20:38955 
> (ip-172-30-2-20.mesosphere.io)
> ../../src/tests/health_check_tests.cpp:1398: Failure
> I0929 17:22:05.274567 26849 master.cpp:5723] Forwarding status update 
> TASK_RUNNING (UUID: f5408ac9-f6ba-447f-b3d7-9dce44384ffe) for task 
> aa0792d3-8d85-4c32-bd04-56a9b552ebda in health state unhealthy of framework 
> 2e0e9ea1-0ae5-4f28-80bb-a9abc56c5a6f-
> Value of: statusHealth.get().healthy()
>   Actual: false
>   Expected: true
> I0929 17:22:05.274636 26849 master.cpp:7560] Updating the state of task 
> aa0792d3-8d85-4c32-bd04-56a9b552ebda of framework 
> 2e0e9ea1-0ae5-4f28-80bb-a9abc56c5a6f- (latest state: TASK_RUNNING, status 
> update state: TASK_RUNNING)
> I0929 17:22:05.274829 26844 sched.cpp:1025] Scheduler::statusUpdate took 
> 43297ns
> Received SHUTDOWN event
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6279) Add test cases for the TCP health check

2016-10-05 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-6279:
---
Shepherd: Alexander Rukletsov
  Sprint: Mesosphere Sprint 44
Story Points: 3
Target Version/s: 1.1.0
  Labels: health-check mesosphere test  (was: health-check)
 Component/s: tests

> Add test cases for the TCP health check
> ---
>
> Key: MESOS-6279
> URL: https://issues.apache.org/jira/browse/MESOS-6279
> Project: Mesos
>  Issue Type: Task
>  Components: tests
>Reporter: haosdent
>Assignee: haosdent
>  Labels: health-check, mesosphere, test
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6278) Add test cases for the HTTP health checks

2016-10-05 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-6278:
---
Shepherd: Alexander Rukletsov
  Sprint: Mesosphere Sprint 44
Story Points: 3
Target Version/s: 1.1.0
  Labels: health-check mesosphere test  (was: health-check test)
 Component/s: tests

> Add test cases for the HTTP health checks
> -
>
> Key: MESOS-6278
> URL: https://issues.apache.org/jira/browse/MESOS-6278
> Project: Mesos
>  Issue Type: Task
>  Components: tests
>Reporter: haosdent
>Assignee: haosdent
>  Labels: health-check, mesosphere, test
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6309) Mesos-specific targets appear in libprocess' cmake config.

2016-10-04 Thread Alexander Rukletsov (JIRA)
Alexander Rukletsov created MESOS-6309:
--

 Summary: Mesos-specific targets appear in libprocess' cmake config.
 Key: MESOS-6309
 URL: https://issues.apache.org/jira/browse/MESOS-6309
 Project: Mesos
  Issue Type: Improvement
  Components: build
Affects Versions: 1.0.0
Reporter: Alexander Rukletsov


File 
https://github.com/apache/mesos/blob/caaa98bd6e0a3a80f10a94e320d4581883f8453c/3rdparty/libprocess/cmake/Process3rdpartyConfigure.cmake
 defines some Mesos-related targets, e.g. {{MESOS_FETCHER}} that do not belong 
in libprocess.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6119) TCP health checks are not portable.

2016-10-06 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-6119:
---
 Priority: Major  (was: Blocker)
Fix Version/s: (was: 1.1.0)

> TCP health checks are not portable.
> ---
>
> Key: MESOS-6119
> URL: https://issues.apache.org/jira/browse/MESOS-6119
> Project: Mesos
>  Issue Type: Bug
>Reporter: Alexander Rukletsov
>Assignee: Alexander Rukletsov
>  Labels: health-check, mesosphere
>
> MESOS-3567 introduced a dependency on "bash" for TCP health checks, which is 
> undesirable. We should implement a portable solution for TCP health checks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6288) The default executor should maintain launcher_dir.

2016-10-06 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-6288:
---
Fix Version/s: (was: 1.1.0)

> The default executor should maintain launcher_dir.
> --
>
> Key: MESOS-6288
> URL: https://issues.apache.org/jira/browse/MESOS-6288
> Project: Mesos
>  Issue Type: Bug
>Reporter: Alexander Rukletsov
>Assignee: Gastón Kleiman
>  Labels: health-check, mesosphere
>
> Both command and docker executors require {{launcher_dir}} is provided in a 
> flag. This directory contains mesos binaries, e.g. a tcp checker necessary 
> for TCP health check. The default executor should obtain somehow (a flag, env 
> var) and maintain this directory for health checker to use.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6184) Health checks should use a general mechanism to enter namespaces of the task.

2016-10-06 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-6184:
---
Fix Version/s: (was: 1.1.0)

> Health checks should use a general mechanism to enter namespaces of the task.
> -
>
> Key: MESOS-6184
> URL: https://issues.apache.org/jira/browse/MESOS-6184
> Project: Mesos
>  Issue Type: Improvement
>Reporter: haosdent
>Assignee: haosdent
>Priority: Blocker
>  Labels: health-check, mesosphere
>
> To perform health checks for tasks, we need to enter the corresponding 
> namespaces of the container. For now health check use custom clone to 
> implement this
> {code}
>   return process::defaultClone([=]() -> int {
> if (taskPid.isSome()) {
>   foreach (const string& ns, namespaces) {
> Try setns = ns::setns(taskPid.get(), ns);
> if (setns.isError()) {
>   ...
> }
>   }
> }
> return func();
>   });
> {code}
> After the childHooks patches merged, we could change the health check to use 
> childHooks to call {{setns}} and make {{process::defaultClone}} private 
> again.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6278) Add test cases for the HTTP health checks

2016-10-06 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-6278:
---
Target Version/s:   (was: 1.1.0)

> Add test cases for the HTTP health checks
> -
>
> Key: MESOS-6278
> URL: https://issues.apache.org/jira/browse/MESOS-6278
> Project: Mesos
>  Issue Type: Task
>  Components: tests
>Reporter: haosdent
>Assignee: haosdent
>  Labels: health-check, mesosphere, test
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6279) Add test cases for the TCP health check

2016-10-06 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-6279:
---
Target Version/s:   (was: 1.1.0)

> Add test cases for the TCP health check
> ---
>
> Key: MESOS-6279
> URL: https://issues.apache.org/jira/browse/MESOS-6279
> Project: Mesos
>  Issue Type: Task
>  Components: tests
>Reporter: haosdent
>Assignee: haosdent
>  Labels: health-check, mesosphere, test
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6119) TCP health checks are not portable.

2016-10-05 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-6119:
---
Shepherd: Till Toenshoff

> TCP health checks are not portable.
> ---
>
> Key: MESOS-6119
> URL: https://issues.apache.org/jira/browse/MESOS-6119
> Project: Mesos
>  Issue Type: Bug
>Reporter: Alexander Rukletsov
>Assignee: Alexander Rukletsov
>Priority: Blocker
>  Labels: health-check, mesosphere
> Fix For: 1.1.0
>
>
> MESOS-3567 introduced a dependency on "bash" for TCP health checks, which is 
> undesirable. We should implement a portable solution for TCP health checks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6321) CHECK failure in HierarchicalAllocatorTest.NoDoubleAccounting

2016-10-07 Thread Alexander Rukletsov (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15554654#comment-15554654
 ] 

Alexander Rukletsov commented on MESOS-6321:


Good run should look like this:
{noformat}
[ RUN  ] HierarchicalAllocatorTest.NoDoubleAccounting
I1007 11:29:37.357229 3211264 hierarchical.cpp:149] Initialized hierarchical 
allocator process
I1007 11:29:37.357724 1601536 hierarchical.cpp:275] Added framework framework1
I1007 11:29:37.357810 1601536 hierarchical.cpp:1694] No allocations performed
I1007 11:29:37.357842 1601536 hierarchical.cpp:1789] No inverse offers to send 
out!
I1007 11:29:37.357875 1601536 hierarchical.cpp:1286] Performed allocation for 0 
agents in 127us
I1007 11:29:37.358070 1601536 hierarchical.cpp:485] Added agent agent1 (agent1) 
with cpus(*):1 (allocated: cpus(*):1)
I1007 11:29:37.358151 1601536 hierarchical.cpp:1694] No allocations performed
I1007 11:29:37.358165 1601536 hierarchical.cpp:1789] No inverse offers to send 
out!
I1007 11:29:37.358182 1601536 hierarchical.cpp:1309] Performed allocation for 
agent agent1 in 87us
I1007 11:29:37.358243 1601536 hierarchical.cpp:485] Added agent agent2 (agent2) 
with cpus(*):1 (allocated: cpus(*):1)
I1007 11:29:37.358337 1601536 hierarchical.cpp:1694] No allocations performed
I1007 11:29:37.358361 1601536 hierarchical.cpp:1789] No inverse offers to send 
out!
I1007 11:29:37.358373 1601536 hierarchical.cpp:1309] Performed allocation for 
agent agent2 in 102us
I1007 11:29:37.358554 1601536 hierarchical.cpp:275] Added framework framework2
I1007 11:29:37.358619 1601536 hierarchical.cpp:1694] No allocations performed
I1007 11:29:37.358649 1601536 hierarchical.cpp:1789] No inverse offers to send 
out!
I1007 11:29:37.358662 1601536 hierarchical.cpp:1286] Performed allocation for 2 
agents in 95us
I1007 11:29:37.358786 1064960 process.cpp:3377] Handling HTTP event for process 
'metrics' with path: '/metrics/snapshot'
[   OK ] HierarchicalAllocatorTest.NoDoubleAccounting (18 ms)
{noformat}

The test failed because allocation events are processed after the metrics 
event, meaning metrics do not contain information we are looking for. The fix 
would be to make sure allocation events are processed *before* querying metrics.

> CHECK failure in HierarchicalAllocatorTest.NoDoubleAccounting
> -
>
> Key: MESOS-6321
> URL: https://issues.apache.org/jira/browse/MESOS-6321
> Project: Mesos
>  Issue Type: Bug
>Reporter: Neil Conway
>Assignee: Alexander Rukletsov
>  Labels: mesosphere
>
> Observed in internal CI:
> {noformat}
> [15:52:21] : [Step 10/10] [ RUN  ] 
> HierarchicalAllocatorTest.NoDoubleAccounting
> [15:52:21]W: [Step 10/10] I1006 15:52:21.813817 23713 
> hierarchical.cpp:275] Added framework framework1
> [15:52:21]W: [Step 10/10] I1006 15:52:21.814100 23713 
> hierarchical.cpp:1694] No allocations performed
> [15:52:21]W: [Step 10/10] I1006 15:52:21.814102 23712 process.cpp:3377] 
> Handling HTTP event for process 'metrics' with path: '/metrics/snapshot'
> [15:52:21]W: [Step 10/10] I1006 15:52:21.814121 23713 
> hierarchical.cpp:1789] No inverse offers to send out!
> [15:52:21]W: [Step 10/10] I1006 15:52:21.814146 23713 
> hierarchical.cpp:1286] Performed allocation for 0 agents in 52445ns
> [15:52:21]W: [Step 10/10] I1006 15:52:21.814206 23713 
> hierarchical.cpp:485] Added agent agent1 (agent1) with cpus(*):1 (allocated: 
> cpus(*):1)
> [15:52:21]W: [Step 10/10] I1006 15:52:21.814237 23713 
> hierarchical.cpp:1694] No allocations performed
> [15:52:21]W: [Step 10/10] I1006 15:52:21.814247 23713 
> hierarchical.cpp:1789] No inverse offers to send out!
> [15:52:21]W: [Step 10/10] I1006 15:52:21.814259 23713 
> hierarchical.cpp:1309] Performed allocation for agent agent1 in 33887ns
> [15:52:21]W: [Step 10/10] I1006 15:52:21.814294 23713 
> hierarchical.cpp:485] Added agent agent2 (agent2) with cpus(*):1 (allocated: 
> cpus(*):1)
> [15:52:21]W: [Step 10/10] I1006 15:52:21.814332 23713 
> hierarchical.cpp:1694] No allocations performed
> [15:52:21]W: [Step 10/10] I1006 15:52:21.814342 23713 
> hierarchical.cpp:1789] No inverse offers to send out!
> [15:52:21]W: [Step 10/10] I1006 15:52:21.814349 23713 
> hierarchical.cpp:1309] Performed allocation for agent agent2 in 42682ns
> [15:52:21]W: [Step 10/10] I1006 15:52:21.814417 23713 
> hierarchical.cpp:275] Added framework framework2
> [15:52:21]W: [Step 10/10] I1006 15:52:21.814445 23713 
> hierarchical.cpp:1694] No allocations performed
> [15:52:21]W: [Step 10/10] I1006 15:52:21.814455 23713 
> hierarchical.cpp:1789] No inverse offers to send out!
> [15:52:21]W: [Step 10/10] I1006 15:52:21.814469 23713 
> hierarchical.cpp:1286] Performed allocation for 2 agents in 37976ns
> 

[jira] [Updated] (MESOS-6321) CHECK failure in HierarchicalAllocatorTest.NoDoubleAccounting

2016-10-07 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-6321:
---
Shepherd: Michael Park
  Sprint: Mesosphere Sprint 44
Story Points: 1
Target Version/s: 1.1.0

> CHECK failure in HierarchicalAllocatorTest.NoDoubleAccounting
> -
>
> Key: MESOS-6321
> URL: https://issues.apache.org/jira/browse/MESOS-6321
> Project: Mesos
>  Issue Type: Bug
>Reporter: Neil Conway
>Assignee: Alexander Rukletsov
>  Labels: mesosphere
>
> Observed in internal CI:
> {noformat}
> [15:52:21] : [Step 10/10] [ RUN  ] 
> HierarchicalAllocatorTest.NoDoubleAccounting
> [15:52:21]W: [Step 10/10] I1006 15:52:21.813817 23713 
> hierarchical.cpp:275] Added framework framework1
> [15:52:21]W: [Step 10/10] I1006 15:52:21.814100 23713 
> hierarchical.cpp:1694] No allocations performed
> [15:52:21]W: [Step 10/10] I1006 15:52:21.814102 23712 process.cpp:3377] 
> Handling HTTP event for process 'metrics' with path: '/metrics/snapshot'
> [15:52:21]W: [Step 10/10] I1006 15:52:21.814121 23713 
> hierarchical.cpp:1789] No inverse offers to send out!
> [15:52:21]W: [Step 10/10] I1006 15:52:21.814146 23713 
> hierarchical.cpp:1286] Performed allocation for 0 agents in 52445ns
> [15:52:21]W: [Step 10/10] I1006 15:52:21.814206 23713 
> hierarchical.cpp:485] Added agent agent1 (agent1) with cpus(*):1 (allocated: 
> cpus(*):1)
> [15:52:21]W: [Step 10/10] I1006 15:52:21.814237 23713 
> hierarchical.cpp:1694] No allocations performed
> [15:52:21]W: [Step 10/10] I1006 15:52:21.814247 23713 
> hierarchical.cpp:1789] No inverse offers to send out!
> [15:52:21]W: [Step 10/10] I1006 15:52:21.814259 23713 
> hierarchical.cpp:1309] Performed allocation for agent agent1 in 33887ns
> [15:52:21]W: [Step 10/10] I1006 15:52:21.814294 23713 
> hierarchical.cpp:485] Added agent agent2 (agent2) with cpus(*):1 (allocated: 
> cpus(*):1)
> [15:52:21]W: [Step 10/10] I1006 15:52:21.814332 23713 
> hierarchical.cpp:1694] No allocations performed
> [15:52:21]W: [Step 10/10] I1006 15:52:21.814342 23713 
> hierarchical.cpp:1789] No inverse offers to send out!
> [15:52:21]W: [Step 10/10] I1006 15:52:21.814349 23713 
> hierarchical.cpp:1309] Performed allocation for agent agent2 in 42682ns
> [15:52:21]W: [Step 10/10] I1006 15:52:21.814417 23713 
> hierarchical.cpp:275] Added framework framework2
> [15:52:21]W: [Step 10/10] I1006 15:52:21.814445 23713 
> hierarchical.cpp:1694] No allocations performed
> [15:52:21]W: [Step 10/10] I1006 15:52:21.814455 23713 
> hierarchical.cpp:1789] No inverse offers to send out!
> [15:52:21]W: [Step 10/10] I1006 15:52:21.814469 23713 
> hierarchical.cpp:1286] Performed allocation for 2 agents in 37976ns
> [15:52:21]W: [Step 10/10] F1006 15:52:21.824954 23692 json.hpp:334] Check 
> failed: 'boost::get(this)' Must be non NULL
> [15:52:21]W: [Step 10/10] *** Check failure stack trace: ***
> [15:52:21]W: [Step 10/10] @ 0x7fe953bbd71d  
> google::LogMessage::Fail()
> [15:52:21]W: [Step 10/10] @ 0x7fe953bbf55d  
> google::LogMessage::SendToLog()
> [15:52:21]W: [Step 10/10] @ 0x7fe953bbd30c  
> google::LogMessage::Flush()
> [15:52:21]W: [Step 10/10] @ 0x7fe953bbfe59  
> google::LogMessageFatal::~LogMessageFatal()
> [15:52:21]W: [Step 10/10] @   0x7cc903  JSON::Value::as<>()
> [15:52:21]W: [Step 10/10] @   0x8b633c  
> mesos::internal::tests::HierarchicalAllocatorTest_NoDoubleAccounting_Test::TestBody()
> [15:52:21]W: [Step 10/10] @  0x129ce23  
> testing::internal::HandleExceptionsInMethodIfSupported<>()
> [15:52:21]W: [Step 10/10] @  0x1292f07  testing::Test::Run()
> [15:52:21]W: [Step 10/10] @  0x1292fae  
> testing::TestInfo::Run()
> [15:52:21]W: [Step 10/10] @  0x12930b5  
> testing::TestCase::Run()
> [15:52:21]W: [Step 10/10] @  0x1293368  
> testing::internal::UnitTestImpl::RunAllTests()
> [15:52:21]W: [Step 10/10] @  0x1293624  
> testing::UnitTest::Run()
> [15:52:21]W: [Step 10/10] @   0x507254  main
> [15:52:21]W: [Step 10/10] @ 0x7fe95122876d  (unknown)
> [15:52:21]W: [Step 10/10] @   0x51e341  (unknown)
> [15:52:21]W: [Step 10/10] Aborted (core dumped)
> [15:52:21]W: [Step 10/10] Process exited with code 134
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6157) ContainerInfo is not validated.

2016-10-06 Thread Alexander Rukletsov (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15552352#comment-15552352
 ] 

Alexander Rukletsov commented on MESOS-6157:


Apparently, {{ContainerInfo}} could also be set for non-container tasks and can 
also be interpreted as which containerizer to use. I've reverted the 
validation, see https://reviews.apache.org/r/51865 for details.
{noformat}
Commit: f93f4fca57added6b0bff04a3e12699eaef13da9 [f93f4fc]
Parents: 001c55c306
Author: Alexander Rukletsov 
Date: 20 September 2016 at 14:41:15 GMT+2
Commit Date: 20 September 2016 at 16:58:19 GMT+2
Labels: alexr/container-additions-revert

Revert "Added validation for `ContainerInfo`."

This reverts commit e65f580bf0cbea64cedf521cf169b9b4c9f85454.
{noformat}

> ContainerInfo is not validated.
> ---
>
> Key: MESOS-6157
> URL: https://issues.apache.org/jira/browse/MESOS-6157
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.0.0
>Reporter: Alexander Rukletsov
>Assignee: Alexander Rukletsov
>Priority: Blocker
>  Labels: containerizer, mesos-containerizer, mesosphere
> Fix For: 1.1.0
>
>
> Currently Mesos does not validate {{ContainerInfo}} provided with 
> {{TaskInfo}} or {{ExecutorInfo}}, hence invalid task configurations can be 
> accepted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5770) Mesos state api reporting Host IP instead of container IP with health check

2016-10-10 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-5770:
---
Affects Version/s: 0.28.2

> Mesos state api reporting Host IP instead of container IP with health check 
> 
>
> Key: MESOS-5770
> URL: https://issues.apache.org/jira/browse/MESOS-5770
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.28.2
>Reporter: Lax
>Priority: Critical
> Fix For: 1.0.0
>
>
> Am using Mesos IP per container using docker containerizer (via Calico). 
> Mesos state API (/master/state.json) seems to report container IP as long as 
> I have no health check on my task. As soon as I add health check to the task, 
> mesos start reporting Host IP instead of Container IP.
> Had initially opened this bug on Marathon 
> (https://github.com/mesosphere/marathon/issues/3907). But then told the issue 
> is with Mesos reporting the wrong IP. 
> Here are versions of Mesos and Marathon I was using.
> Mesos: 0.28.0.2
> Marathon: 0.15.3-1.0



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5874) Only send ShutdownFrameworkMessage to agents associated with framework.

2016-10-10 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-5874:
---
Fix Version/s: (was: 1.1.0)

> Only send ShutdownFrameworkMessage to agents associated with framework.
> ---
>
> Key: MESOS-5874
> URL: https://issues.apache.org/jira/browse/MESOS-5874
> Project: Mesos
>  Issue Type: Improvement
>  Components: master
>Reporter: Jacob Janco
>Assignee: Jacob Janco
>Priority: Minor
>  Labels: mesosphere
>
> slave.cpp:2079] Asked to shut down framework ${framework} by master@${master}
> slave.cpp:2094] Cannot shut down unknown framework ${framework} 
> For high framework/churn clusters this saturates agent logs with these 
> messages. When a framework terminates a ShutdownFrameworkMessage is sent to 
> every registered slave in a for loop. This patch proposes sending this 
> message to agents with executors associated with the framework. 
> Also proposed is moving the logline to VLOG(1). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5967) Add support for 'docker image inspect' in our docker abstraction

2016-10-10 Thread Alexander Rukletsov (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15562299#comment-15562299
 ] 

Alexander Rukletsov commented on MESOS-5967:


[~klueska], [~bmahler] Retargeting this for 1.2, please speak up if you want to 
land this in 1.1.0.

> Add support for 'docker image inspect' in our docker abstraction
> 
>
> Key: MESOS-5967
> URL: https://issues.apache.org/jira/browse/MESOS-5967
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Kevin Klues
>Assignee: Guangya Liu
>  Labels: gpu
>
> Docker's command line tool for {{docker inspect}} can take either a 
> {{container}}, an {{image}}, or a {{task}} as its argument, and return a JSON 
> array containing low-level information about that container, image or task. 
> However, the current {{docker inspect}} support in our docker abstraction 
> only supports inspecting containers (not images or tasks).  We should expand 
> this to (at least) support images.
> In particular, this additional functionality is motivated by the upcoming GPU 
> support, which needs to inspect the labels in a docker image to decide if it 
> should inject the required Nvidia volumes into a container.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5967) Add support for 'docker image inspect' in our docker abstraction

2016-10-10 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-5967:
---
Target Version/s: 1.2.0
   Fix Version/s: (was: 1.1.0)

> Add support for 'docker image inspect' in our docker abstraction
> 
>
> Key: MESOS-5967
> URL: https://issues.apache.org/jira/browse/MESOS-5967
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Kevin Klues
>Assignee: Guangya Liu
>  Labels: gpu
>
> Docker's command line tool for {{docker inspect}} can take either a 
> {{container}}, an {{image}}, or a {{task}} as its argument, and return a JSON 
> array containing low-level information about that container, image or task. 
> However, the current {{docker inspect}} support in our docker abstraction 
> only supports inspecting containers (not images or tasks).  We should expand 
> this to (at least) support images.
> In particular, this additional functionality is motivated by the upcoming GPU 
> support, which needs to inspect the labels in a docker image to decide if it 
> should inject the required Nvidia volumes into a container.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5967) Add support for 'docker image inspect' in our docker abstraction.

2016-10-10 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-5967:
---
Summary: Add support for 'docker image inspect' in our docker abstraction.  
(was: Add support for 'docker image inspect' in our docker abstraction)

> Add support for 'docker image inspect' in our docker abstraction.
> -
>
> Key: MESOS-5967
> URL: https://issues.apache.org/jira/browse/MESOS-5967
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Kevin Klues
>Assignee: Guangya Liu
>  Labels: gpu
>
> Docker's command line tool for {{docker inspect}} can take either a 
> {{container}}, an {{image}}, or a {{task}} as its argument, and return a JSON 
> array containing low-level information about that container, image or task. 
> However, the current {{docker inspect}} support in our docker abstraction 
> only supports inspecting containers (not images or tasks).  We should expand 
> this to (at least) support images.
> In particular, this additional functionality is motivated by the upcoming GPU 
> support, which needs to inspect the labels in a docker image to decide if it 
> should inject the required Nvidia volumes into a container.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6006) Abstract mesos-style.py to allow future linters to be added more easily

2016-10-10 Thread Alexander Rukletsov (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15562308#comment-15562308
 ] 

Alexander Rukletsov commented on MESOS-6006:


[~kaysoky], [~klueska] what is the status here? Retargeting this for 1.2, 
please speak up if you want to land this in 1.1.0.

> Abstract mesos-style.py to allow future linters to be added more easily
> ---
>
> Key: MESOS-6006
> URL: https://issues.apache.org/jira/browse/MESOS-6006
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Kevin Klues
>Assignee: Kevin Klues
>  Labels: cli, mesosphere
> Fix For: 1.1.0
>
>
> Currently, mesos-style.py is just a collection of functions that
> check the style of relevant files in the mesos code base.  However,
> the script assumes that we always wanted to run cpplint over every
> file we are checking. Since we are planning on adding a python linter
> to the codebase soon, it makes sense to abstract the common
> functionality from this script into a class so that a cpp-based linter
> and a python-based linter can inherit the same set of common
> functionality.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6006) Abstract mesos-style.py to allow future linters to be added more easily.

2016-10-10 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-6006:
---
Target Version/s: 1.2.0
   Fix Version/s: (was: 1.1.0)
 Summary: Abstract mesos-style.py to allow future linters to be 
added more easily.  (was: Abstract mesos-style.py to allow future linters to be 
added more easily)

> Abstract mesos-style.py to allow future linters to be added more easily.
> 
>
> Key: MESOS-6006
> URL: https://issues.apache.org/jira/browse/MESOS-6006
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Kevin Klues
>Assignee: Kevin Klues
>  Labels: cli, mesosphere
>
> Currently, mesos-style.py is just a collection of functions that
> check the style of relevant files in the mesos code base.  However,
> the script assumes that we always wanted to run cpplint over every
> file we are checking. Since we are planning on adding a python linter
> to the codebase soon, it makes sense to abstract the common
> functionality from this script into a class so that a cpp-based linter
> and a python-based linter can inherit the same set of common
> functionality.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-2449) Support group of tasks (Pod) constructs and API in Mesos

2016-10-10 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-2449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-2449:
---
Fix Version/s: (was: 1.1.0)

> Support group of tasks (Pod) constructs and API in Mesos
> 
>
> Key: MESOS-2449
> URL: https://issues.apache.org/jira/browse/MESOS-2449
> Project: Mesos
>  Issue Type: Epic
>Reporter: Timothy Chen
>  Labels: mesosphere
>
> There is a common need among different frameworks, that wants to start a 
> group of tasks that are either depend or co-located with each other.
> Although a framework can schedule individual tasks within the same offer and 
> slave id, it doesn't have a way to describe dependencies, failure policies 
> (if one of the task failed), network setup, and group container information, 
> etc.
> Want to create a epic to start the discussion around the requirements folks 
> need, and see where we can lead this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2449) Support group of tasks (Pod) constructs and API in Mesos

2016-10-10 Thread Alexander Rukletsov (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15562168#comment-15562168
 ] 

Alexander Rukletsov commented on MESOS-2449:


Removing 'fix version' since this is an epic.

> Support group of tasks (Pod) constructs and API in Mesos
> 
>
> Key: MESOS-2449
> URL: https://issues.apache.org/jira/browse/MESOS-2449
> Project: Mesos
>  Issue Type: Epic
>Reporter: Timothy Chen
>  Labels: mesosphere
>
> There is a common need among different frameworks, that wants to start a 
> group of tasks that are either depend or co-located with each other.
> Although a framework can schedule individual tasks within the same offer and 
> slave id, it doesn't have a way to describe dependencies, failure policies 
> (if one of the task failed), network setup, and group container information, 
> etc.
> Want to create a epic to start the discussion around the requirements folks 
> need, and see where we can lead this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5902) CMake should generate protobuf definitions for Java

2016-10-10 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-5902:
---
Affects Version/s: (was: 1.1.0)

> CMake should generate protobuf definitions for Java
> ---
>
> Key: MESOS-5902
> URL: https://issues.apache.org/jira/browse/MESOS-5902
> Project: Mesos
>  Issue Type: Improvement
>  Components: build
> Environment: CMake
>Reporter: Srinivas
>Assignee: Srinivas
>
> Currently Java protobuf bindings require java protobuf library to generate 
> and compile the sources. We should build protobuf-java-2.6.1.jar from the 
> protobuf sources just like we build the mesos protobuf for C++.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3633) Port stout/path.hpp to Windows

2016-10-10 Thread Alexander Rukletsov (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15562231#comment-15562231
 ] 

Alexander Rukletsov commented on MESOS-3633:


[~jvanremoortere], [~kaysoky], [~hausdorff] What is the status here? Shall the 
fix version be updated?

> Port stout/path.hpp to Windows
> --
>
> Key: MESOS-3633
> URL: https://issues.apache.org/jira/browse/MESOS-3633
> Project: Mesos
>  Issue Type: Task
>  Components: stout
>Reporter: Alex Clemmer
>Assignee: Alex Clemmer
> Fix For: 0.27.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4259) mesos HA can't delete the the redundant container on failure slave node.

2016-10-10 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-4259:
---
Target Version/s:   (was: 0.25.0)
   Fix Version/s: (was: 0.25.0)

> mesos HA can't delete the  the redundant container on failure slave node.
> -
>
> Key: MESOS-4259
> URL: https://issues.apache.org/jira/browse/MESOS-4259
> Project: Mesos
>  Issue Type: Bug
>  Components: docker, framework
>Affects Versions: 0.25.0
>Reporter: wangqun
>Priority: Critical
>  Labels: patch
> Attachments: canon.pdf
>
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> We have setup one mesos cluster. One Master nodes, and two slave nodes.
> We want to test the HA, but we find the mesos HA can't delete the  the 
> redundant container on failure slave node.
> 1.We create one container on slave node. 
> 2.stop the slave node having the container,and the container can transfer to 
> the remaining slave node. 
> However, if we restore the slave node stoped by us. The status of the 
> original container is exited. Then I start the container manually and it can 
> start up. i.e. There are two contain is running on different slave node.
> I think that the container and new container is repeated after migration. 
> They are redundant. Can mesos delete automatically the redundant container?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6208) Containers that use the Mesos containerizer but don't want to provision a container image fail to validate.

2016-09-20 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-6208:
---
 Assignee: Alexander Rukletsov
   Sprint: Mesosphere Sprint 43
 Story Points: 1
   Labels: mesosphere  (was: )
Fix Version/s: 1.1.0

> Containers that use the Mesos containerizer but don't want to provision a 
> container image fail to validate.
> ---
>
> Key: MESOS-6208
> URL: https://issues.apache.org/jira/browse/MESOS-6208
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
> Environment: Mesos HEAD, change was introduced with 
> e65f580bf0cbea64cedf521cf169b9b4c9f85454
>Reporter: Jan Schlicht
>Assignee: Alexander Rukletsov
>  Labels: mesosphere
> Fix For: 1.1.0
>
>
> Tasks using  features like volumes or CNI in their containers, have to define 
> these in {{TaskInfo.container}}. When these tasks don't want/need to 
> provision a container image, neither {{ContainerInfo.docker}} nor 
> {{ContainerInfo.mesos}} will be set. Nevertheless, the container type in 
> {{ContainerInfo.type}} needs to be set, because it is a required field.
> In that case, the recently introduced validation rules in 
> {{master/validation.cpp}} ({{validateContainerInfo}} will fail, which isn't 
> expected.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6157) ContainerInfo is not validated.

2016-09-20 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-6157:
---
Shepherd: Jie Yu
  Sprint: Mesosphere Sprint 42, Mesosphere Sprint 43  (was: Mesosphere 
Sprint 42)
Story Points: 3  (was: 1)

> ContainerInfo is not validated.
> ---
>
> Key: MESOS-6157
> URL: https://issues.apache.org/jira/browse/MESOS-6157
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.0.0
>Reporter: Alexander Rukletsov
>Assignee: Alexander Rukletsov
>  Labels: containerizer, mesos-containerizer, mesosphere
> Fix For: 1.1.0
>
>
> Currently Mesos does not validate {{ContainerInfo}} provided with 
> {{TaskInfo}} or {{ExecutorInfo}}, hence invalid task configurations can be 
> accepted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2564) Kill superfluous forward declaration comments.

2016-09-21 Thread Alexander Rukletsov (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15510409#comment-15510409
 ] 

Alexander Rukletsov commented on MESOS-2564:


https://reviews.apache.org/r/32608/

> Kill superfluous forward declaration comments.
> --
>
> Key: MESOS-2564
> URL: https://issues.apache.org/jira/browse/MESOS-2564
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Alexander Rukletsov
>Priority: Minor
>  Labels: easyfix, newbie
>
> We often prepend forward declarations with a comment, which is pretty 
> useless, e.g.: 
> {code}
> // Forward declarations.
> class LogStorageProcess;
> {code}
> or
> {code}
> // Forward declarations.
> namespace registry {
> class Slaves;
> }
> class Authorizer;
> class WhitelistWatcher;
> {code}
> This JIRA aims to clean up such comments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5987) Update health check protobuf for HTTP and TCP health check

2016-09-19 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-5987:
---
Sprint: Mesosphere Sprint 40, Mesosphere Sprint 42, Mesosphere Sprint 43  
(was: Mesosphere Sprint 40, Mesosphere Sprint 42)

> Update health check protobuf for HTTP and TCP health check
> --
>
> Key: MESOS-5987
> URL: https://issues.apache.org/jira/browse/MESOS-5987
> Project: Mesos
>  Issue Type: Task
>Reporter: haosdent
>Assignee: haosdent
>  Labels: health-check, mesosphere
> Fix For: 1.1.0
>
>
> To support HTTP and TCP health check, we need to update the existing 
> {{HealthCheck}} protobuf message according to [~alexr] and [~gaston] 
> commented in https://reviews.apache.org/r/36816/ and 
> https://reviews.apache.org/r/49360/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6110) Deprecate using health checks without setting the type

2016-09-19 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-6110:
---
Sprint: Mesosphere Sprint 42, Mesosphere Sprint 43  (was: Mesosphere Sprint 
42)

> Deprecate using health checks without setting the type
> --
>
> Key: MESOS-6110
> URL: https://issues.apache.org/jira/browse/MESOS-6110
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.1.0
>Reporter: Silas Snider
>Assignee: haosdent
>Priority: Blocker
>  Labels: compatibility, health-check, mesosphere
>
> When sending a task launch using the 1.0.x protos and the legacy (non-http) 
> API, tasks with a healthcheck defined are rejected (TASK_ERROR) because the 
> 'type' field is not set.
> This field is marked optional in the proto and is not available before 1.1.0, 
> so it should not be required in order to keep the mesos v1 api compatibility 
> promise.
> For backwards compatibility temporarily allow the use case when command 
> health check is set without a type.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6119) TCP health checks are not portable.

2016-09-19 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-6119:
---
Sprint: Mesosphere Sprint 42, Mesosphere Sprint 43  (was: Mesosphere Sprint 
42)

> TCP health checks are not portable.
> ---
>
> Key: MESOS-6119
> URL: https://issues.apache.org/jira/browse/MESOS-6119
> Project: Mesos
>  Issue Type: Bug
>Reporter: Alexander Rukletsov
>Assignee: Alexander Rukletsov
>  Labels: health-check, mesosphere
>
> MESOS-3567 introduced a dependency on "bash" for TCP health checks, which is 
> undesirable. We should implement a portable solution for TCP health checks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6184) Health checks should use a general mechanism to enter namespaces of the task.

2016-09-23 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-6184:
---
Summary: Health checks should use a general mechanism to enter namespaces 
of the task.  (was: Change health check to use childHooks to enter the 
namespaces of the container)

> Health checks should use a general mechanism to enter namespaces of the task.
> -
>
> Key: MESOS-6184
> URL: https://issues.apache.org/jira/browse/MESOS-6184
> Project: Mesos
>  Issue Type: Improvement
>Reporter: haosdent
>Assignee: haosdent
>  Labels: health-check, mesosphere
> Fix For: 1.1.0
>
>
> To perform health checks for tasks, we need to enter the corresponding 
> namespaces of the container. For now health check use custom clone to 
> implement this
> {code}
>   return process::defaultClone([=]() -> int {
> if (taskPid.isSome()) {
>   foreach (const string& ns, namespaces) {
> Try setns = ns::setns(taskPid.get(), ns);
> if (setns.isError()) {
>   ...
> }
>   }
> }
> return func();
>   });
> {code}
> After the childHooks patches merged, we could change the health check to use 
> childHooks to call {{setns}} and make {{process::defaultClone}} private 
> again.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6184) Change health check to use childHooks to enter the namespaces of the container

2016-09-23 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-6184:
---
 Shepherd: Alexander Rukletsov
 Story Points: 3
   Labels: health-check mesosphere  (was: health-check)
Fix Version/s: 1.1.0

> Change health check to use childHooks to enter the namespaces of the container
> --
>
> Key: MESOS-6184
> URL: https://issues.apache.org/jira/browse/MESOS-6184
> Project: Mesos
>  Issue Type: Improvement
>Reporter: haosdent
>Assignee: haosdent
>  Labels: health-check, mesosphere
> Fix For: 1.1.0
>
>
> To perform health checks for tasks, we need to enter the corresponding 
> namespaces of the container. For now health check use custom clone to 
> implement this
> {code}
>   return process::defaultClone([=]() -> int {
> if (taskPid.isSome()) {
>   foreach (const string& ns, namespaces) {
> Try setns = ns::setns(taskPid.get(), ns);
> if (setns.isError()) {
>   ...
> }
>   }
> }
> return func();
>   });
> {code}
> After the childHooks patches merged, we could change the health check to use 
> childHooks to call {{setns}} and make {{process::defaultClone}} private 
> again.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6236) Launch subprocesses associated with specified namespaces.

2016-09-23 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-6236:
---
Description: 
Currently there is no standard way in Mesos to launch a child process in a 
different namespace (e.g. {{net}}, {{mnt}}). A user may leverage {{Subprocess}} 
and provide its own {{clone}} callback, but this approach is error-prone.

One possible solution is to implement a {{Subprocess}}' child hook. In 
[MESOS-5070|https://issues.apache.org/jira/browse/MESOS-5070], we have 
introduced a child hook framework in subprocess and implemented three child 
hooks {{CHDIR}}, {{SETSID}} and {{SUPERVISOR}}. We suggest to introduce another 
child hook {{SETNS}} so that other components (e.g., health check) can call it 
to enter the namespaces of a specific process.

  was:In [MESOS-5070|https://issues.apache.org/jira/browse/MESOS-5070], we have 
introduced a child hook framework in subprocess and implemented three child 
hooks {{CHDIR}}, {{SETSID}} and {{SUPERVISOR}}. In this ticket, we'd like to 
introduce another child hook {{SETNS}} so that other components (e.g., health 
check) can call it to enter the namespaces of a specific process.


> Launch subprocesses associated with specified namespaces.
> -
>
> Key: MESOS-6236
> URL: https://issues.apache.org/jira/browse/MESOS-6236
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Qian Zhang
>Assignee: Qian Zhang
>  Labels: mesosphere
> Fix For: 1.1.0
>
>
> Currently there is no standard way in Mesos to launch a child process in a 
> different namespace (e.g. {{net}}, {{mnt}}). A user may leverage 
> {{Subprocess}} and provide its own {{clone}} callback, but this approach is 
> error-prone.
> One possible solution is to implement a {{Subprocess}}' child hook. In 
> [MESOS-5070|https://issues.apache.org/jira/browse/MESOS-5070], we have 
> introduced a child hook framework in subprocess and implemented three child 
> hooks {{CHDIR}}, {{SETSID}} and {{SUPERVISOR}}. We suggest to introduce 
> another child hook {{SETNS}} so that other components (e.g., health check) 
> can call it to enter the namespaces of a specific process.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6184) Health checks should use a general mechanism to enter namespaces of the task.

2016-09-23 Thread Alexander Rukletsov (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15516215#comment-15516215
 ] 

Alexander Rukletsov commented on MESOS-6184:


Once we transition to a general solution, there will be no more need to expose 
{{defaultClone}}. See https://reviews.apache.org/r/51636/.

> Health checks should use a general mechanism to enter namespaces of the task.
> -
>
> Key: MESOS-6184
> URL: https://issues.apache.org/jira/browse/MESOS-6184
> Project: Mesos
>  Issue Type: Improvement
>Reporter: haosdent
>Assignee: haosdent
>  Labels: health-check, mesosphere
> Fix For: 1.1.0
>
>
> To perform health checks for tasks, we need to enter the corresponding 
> namespaces of the container. For now health check use custom clone to 
> implement this
> {code}
>   return process::defaultClone([=]() -> int {
> if (taskPid.isSome()) {
>   foreach (const string& ns, namespaces) {
> Try setns = ns::setns(taskPid.get(), ns);
> if (setns.isError()) {
>   ...
> }
>   }
> }
> return func();
>   });
> {code}
> After the childHooks patches merged, we could change the health check to use 
> childHooks to call {{setns}} and make {{process::defaultClone}} private 
> again.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6236) Launch subprocesses associated with specified namespaces.

2016-09-23 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-6236:
---
Summary: Launch subprocesses associated with specified namespaces.  (was: 
Introduce SETNS child hook in subprocess)

> Launch subprocesses associated with specified namespaces.
> -
>
> Key: MESOS-6236
> URL: https://issues.apache.org/jira/browse/MESOS-6236
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Qian Zhang
>Assignee: Qian Zhang
>  Labels: mesosphere
> Fix For: 1.1.0
>
>
> In [MESOS-5070|https://issues.apache.org/jira/browse/MESOS-5070], we have 
> introduced a child hook framework in subprocess and implemented three child 
> hooks {{CHDIR}}, {{SETSID}} and {{SUPERVISOR}}. In this ticket, we'd like to 
> introduce another child hook {{SETNS}} so that other components (e.g., health 
> check) can call it to enter the namespaces of a specific process.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6236) Introduce SETNS child hook in subprocess

2016-09-23 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-6236:
---
 Story Points: 8
   Labels: mesosphere  (was: )
Fix Version/s: 1.1.0

> Introduce SETNS child hook in subprocess
> 
>
> Key: MESOS-6236
> URL: https://issues.apache.org/jira/browse/MESOS-6236
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Qian Zhang
>Assignee: Qian Zhang
>  Labels: mesosphere
> Fix For: 1.1.0
>
>
> In [MESOS-5070|https://issues.apache.org/jira/browse/MESOS-5070], we have 
> introduced a child hook framework in subprocess and implemented three child 
> hooks {{CHDIR}}, {{SETSID}} and {{SUPERVISOR}}. In this ticket, we'd like to 
> introduce another child hook {{SETNS}} so that other components (e.g., health 
> check) can call it to enter the namespaces of a specific process.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5070) Introduce more flexible subprocess interface for child options.

2016-09-22 Thread Alexander Rukletsov (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15512856#comment-15512856
 ] 

Alexander Rukletsov commented on MESOS-5070:


This is correct.

> Introduce more flexible subprocess interface for child options.
> ---
>
> Key: MESOS-5070
> URL: https://issues.apache.org/jira/browse/MESOS-5070
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Joerg Schad
>Assignee: Joerg Schad
>  Labels: tech-debt
>
> We introduced a number of parameters to the subprocess interface with 
> MESOS-5049.
> Adding all options explicitly to the subprocess interface makes it 
> inflexible. 
> We should investigate a flexible options, which still prevents arbitrary code 
> to be executed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5227) Implement HTTP Docker Executor that uses the Executor Library

2016-09-22 Thread Alexander Rukletsov (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15512943#comment-15512943
 ] 

Alexander Rukletsov commented on MESOS-5227:


[~anandmazumdar], are these patches still relevant or we have chose a different 
approach?

> Implement HTTP Docker Executor that uses the Executor Library
> -
>
> Key: MESOS-5227
> URL: https://issues.apache.org/jira/browse/MESOS-5227
> Project: Mesos
>  Issue Type: Bug
>Reporter: Vinod Kone
>Assignee: Yong Tang
>
> Similar to what we did with the HTTP command executor in MESOS-3558 we should 
> have a HTTP docker executor that can speak the v1 Executor API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5123) Docker task may fail if path to agent work_dir is relative.

2016-09-22 Thread Alexander Rukletsov (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15512939#comment-15512939
 ] 

Alexander Rukletsov commented on MESOS-5123:


[~klaus1982] Is r/46298 still relevant?

> Docker task may fail if path to agent work_dir is relative. 
> 
>
> Key: MESOS-5123
> URL: https://issues.apache.org/jira/browse/MESOS-5123
> Project: Mesos
>  Issue Type: Improvement
>  Components: docker
>Affects Versions: 1.0.0
>Reporter: Alexander Rukletsov
>Assignee: Klaus Ma
>  Labels: docker, documentation, mesosphere
>
> When a local folder for agent’s {{\-\-work_dir}} is specified (e.g., 
> {{\-\-work_dir=w/s}}) docker complains that there are forbidden symbols in a 
> *local* volume name. Specifying an absolute path (e.g., 
> {{\-\-work_dir=/tmp}}) solves the problem.
> Docker error observed:
> {noformat}
> docker: Error response from daemon: create 
> w/s/slaves/33b8fe47-e9e0-468a-83a6-98c1e3537e59-S1/frameworks/33b8fe47-e9e0-468a-83a6-98c1e3537e59-0001/executors/docker-test/runs/3cc5cb04-d0a9-490e-94d5-d446b66c97cc:
>  volume name invalid: 
> "w/s/slaves/33b8fe47-e9e0-468a-83a6-98c1e3537e59-S1/frameworks/33b8fe47-e9e0-468a-83a6-98c1e3537e59-0001/executors/docker-test/runs/3cc5cb04-d0a9-490e-94d5-d446b66c97cc"
>  includes invalid characters for a local volume name, only 
> "[a-zA-Z0-9][a-zA-Z0-9_.-]" are allowed.
> {noformat}
> First off, it is not obvious that Mesos always creates a volume for the 
> sandbox. We may want to document it.
> Second, it's hard to understand that local {{work_dir}} can trigger forbidden 
> symbols error in docker. Does it make sense to check it during agent launch 
> if docker containerizer is enabled? Or reject docker tasks during task 
> validation?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5874) Only send ShutdownFrameworkMessage to agents associated with framework.

2016-09-22 Thread Alexander Rukletsov (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15513491#comment-15513491
 ] 

Alexander Rukletsov commented on MESOS-5874:


[~jjanco], do you want to close the ticket and the corresponding review request?

> Only send ShutdownFrameworkMessage to agents associated with framework.
> ---
>
> Key: MESOS-5874
> URL: https://issues.apache.org/jira/browse/MESOS-5874
> Project: Mesos
>  Issue Type: Improvement
>  Components: master
>Reporter: Jacob Janco
>Assignee: Jacob Janco
>Priority: Minor
>  Labels: mesosphere
> Fix For: 1.1.0
>
>
> slave.cpp:2079] Asked to shut down framework ${framework} by master@${master}
> slave.cpp:2094] Cannot shut down unknown framework ${framework} 
> For high framework/churn clusters this saturates agent logs with these 
> messages. When a framework terminates a ShutdownFrameworkMessage is sent to 
> every registered slave in a for loop. This patch proposes sending this 
> message to agents with executors associated with the framework. 
> Also proposed is moving the logline to VLOG(1). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6182) LinuxRootfs::create ignores failures from adding non-existing files

2016-09-16 Thread Alexander Rukletsov (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15496363#comment-15496363
 ] 

Alexander Rukletsov commented on MESOS-6182:


As a first thing, can we make {{LinuxRootfs::create()}} return an error in case 
{{os::realpath()}} returns {{None()}}, which indicates the file can't be found?

> LinuxRootfs::create ignores failures from adding non-existing files
> ---
>
> Key: MESOS-6182
> URL: https://issues.apache.org/jira/browse/MESOS-6182
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Reporter: Benjamin Bannier
>Assignee: Benjamin Bannier
>
> {{LinuxRootfs::create}} attempts to add a hardcoded list of files to the 
> created rootfs. However, if a file does not exist no failure is created, but 
> the file will be missing from the rootfs.
> This can then lead to failures in tests using the rootfs and relying on files 
> in it.
> We should make failures to compose the planned rootfs explicit so users of 
> this test code know what they can rely on.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5987) Update health check protobuf for HTTP and TCP health check

2016-09-07 Thread Alexander Rukletsov (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15471410#comment-15471410
 ] 

Alexander Rukletsov commented on MESOS-5987:


For posterity: the removed message and the changed field were experimental 
features and were not supposed to be part of the stable API. However, we 
decided to restore them to support those who rely on them.

> Update health check protobuf for HTTP and TCP health check
> --
>
> Key: MESOS-5987
> URL: https://issues.apache.org/jira/browse/MESOS-5987
> Project: Mesos
>  Issue Type: Task
>Reporter: haosdent
>Assignee: haosdent
>  Labels: health-check, mesosphere
> Fix For: 1.1.0
>
>
> To support HTTP and TCP health check, we need to update the existing 
> {{HealthCheck}} protobuf message according to [~alexr] and [~gaston] 
> commented in https://reviews.apache.org/r/36816/ and 
> https://reviews.apache.org/r/49360/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5987) Update health check protobuf for HTTP and TCP health check

2016-09-07 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-5987:
---
Sprint: Mesosphere Sprint 40, Mesosphere Sprint 42  (was: Mesosphere Sprint 
40)

> Update health check protobuf for HTTP and TCP health check
> --
>
> Key: MESOS-5987
> URL: https://issues.apache.org/jira/browse/MESOS-5987
> Project: Mesos
>  Issue Type: Task
>Reporter: haosdent
>Assignee: haosdent
>  Labels: health-check, mesosphere
> Fix For: 1.1.0
>
>
> To support HTTP and TCP health check, we need to update the existing 
> {{HealthCheck}} protobuf message according to [~alexr] and [~gaston] 
> commented in https://reviews.apache.org/r/36816/ and 
> https://reviews.apache.org/r/49360/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5961) HTTP and TCP health checks should support docker executor and bridged mode.

2016-08-26 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-5961:
---
Story Points: 8  (was: 3)

> HTTP and TCP health checks should support docker executor and bridged mode.
> ---
>
> Key: MESOS-5961
> URL: https://issues.apache.org/jira/browse/MESOS-5961
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Alexander Rukletsov
>Assignee: haosdent
>  Labels: health-check, mesosphere
>
> If an executor and a task, e.g. the docker executor and docker container in 
> bridged mode, exist in different network namespaces, HTTP and TCP health 
> checks using {{localhost}} may not work properly. One solution would be to 
> enter the container's network in the health check binary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Issue Comment Deleted] (MESOS-5954) Docker executor does not use HealthChecker library.

2016-08-26 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-5954:
---
Comment: was deleted

(was: In order to support MESOS-5961, we decided to re-use the binary to enter 
the container's namespace.)

> Docker executor does not use HealthChecker library.
> ---
>
> Key: MESOS-5954
> URL: https://issues.apache.org/jira/browse/MESOS-5954
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Alexander Rukletsov
>Assignee: haosdent
>  Labels: mesosphere
> Fix For: 1.1.0
>
>
> https://github.com/apache/mesos/commit/1556d9a3a02de4e8a90b5b64d268754f95b12d77
>  refactored health checks into a library. Command executor uses the library 
> instead of the "mesos-health-check" binary, docker executor should do the 
> same for consistency.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5954) Docker executor does not use HealthChecker library.

2016-08-26 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-5954:
---
Sprint: Mesosphere Sprint 41

> Docker executor does not use HealthChecker library.
> ---
>
> Key: MESOS-5954
> URL: https://issues.apache.org/jira/browse/MESOS-5954
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Alexander Rukletsov
>Assignee: haosdent
>  Labels: mesosphere
> Fix For: 1.1.0
>
>
> https://github.com/apache/mesos/commit/1556d9a3a02de4e8a90b5b64d268754f95b12d77
>  refactored health checks into a library. Command executor uses the library 
> instead of the "mesos-health-check" binary, docker executor should do the 
> same for consistency.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Issue Comment Deleted] (MESOS-5955) The "mesos-health-check" binary is not used anymore.

2016-08-26 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-5955:
---
Comment: was deleted

(was: In order to support MESOS-5961, we decided to re-use the binary to enter 
the container's namespace.)

> The "mesos-health-check" binary is not used anymore.
> 
>
> Key: MESOS-5955
> URL: https://issues.apache.org/jira/browse/MESOS-5955
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Alexander Rukletsov
>Assignee: haosdent
>  Labels: mesosphere
> Fix For: 1.1.0
>
>
> MESOS-5727 and MESOS-5954 refactored the health check code into the 
> {{HealthChecker}} library, hence the "mesos-health-check" binary became 
> unused.
> While the command and docker executors could just use the library to avoid 
> the subprocess complexity, we may want to consider keeping a binary version 
> that ships with the installation, because the intention of the binary was to 
> allow other executors to re-use our implementation. On the other side, this 
> binary is ill suited to this since it uses libprocess message passing, so if 
> we do not have code that requires the binary it seems ok to remove it for 
> now. Custom executors may use the {{HealthChecker}} library directly, it is 
> not much more complex than using the binary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5955) The "mesos-health-check" binary is not used anymore.

2016-08-26 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-5955:
---
Sprint: Mesosphere Sprint 41

> The "mesos-health-check" binary is not used anymore.
> 
>
> Key: MESOS-5955
> URL: https://issues.apache.org/jira/browse/MESOS-5955
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Alexander Rukletsov
>Assignee: haosdent
>  Labels: mesosphere
> Fix For: 1.1.0
>
>
> MESOS-5727 and MESOS-5954 refactored the health check code into the 
> {{HealthChecker}} library, hence the "mesos-health-check" binary became 
> unused.
> While the command and docker executors could just use the library to avoid 
> the subprocess complexity, we may want to consider keeping a binary version 
> that ships with the installation, because the intention of the binary was to 
> allow other executors to re-use our implementation. On the other side, this 
> binary is ill suited to this since it uses libprocess message passing, so if 
> we do not have code that requires the binary it seems ok to remove it for 
> now. Custom executors may use the {{HealthChecker}} library directly, it is 
> not much more complex than using the binary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-2533) Support HTTP checks in Mesos.

2016-08-28 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-2533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-2533:
---
Summary: Support HTTP checks in Mesos.  (was: Support HTTP checks in Mesos 
health check program)

> Support HTTP checks in Mesos.
> -
>
> Key: MESOS-2533
> URL: https://issues.apache.org/jira/browse/MESOS-2533
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Niklas Quarfot Nielsen
>Assignee: haosdent
>  Labels: health-check, mesosphere
>
> Currently, only commands are supported but our health check protobuf enables 
> users to encode HTTP checks as well. We should wire up this in the health 
> check program or remove the http field from the protobuf.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-3567) Support TCP checks in Mesos.

2016-08-28 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-3567:
---
Summary: Support TCP checks in Mesos.  (was: Support TCP checks in Mesos 
health check program)

> Support TCP checks in Mesos.
> 
>
> Key: MESOS-3567
> URL: https://issues.apache.org/jira/browse/MESOS-3567
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Matthias Veit
>Assignee: haosdent
>  Labels: health-check, mesosphere
>
> In Marathon we have the ability to specify Health Checks for:
> - Command (Mesos supports this)
> - HTTP (see progress in MESOS-2533)
> - TCP missing
> See here for reference: 
> https://mesosphere.github.io/marathon/docs/health-checks.html
> Since we made good experiences with those 3 options in Marathon, I see a lot 
> of value, if Mesos would also support them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5070) Introduce more flexible subprocess interface for child options.

2016-08-24 Thread Alexander Rukletsov (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15435077#comment-15435077
 ] 

Alexander Rukletsov commented on MESOS-5070:


[~jieyu], [~js84], what are use cases for this change? We needed something 
similar with [~haosd...@gmail.com] recently and came up with the custom clone 
solution. It looks something like
{code}
// Call `setns` between `clone` and `exec`.
pid_t myClone(
const lambda::function& func,
pid_t taskPid,
const vector& namespaces)
{
  pid_t pid = os::clone([=]() -> int {
foreach (const string& ns, namespaces) {
  Try setns = ns::setns(taskPid, namespace);
}

return func();
  }, SIGCHLD);

  return pid;
}
{code}

> Introduce more flexible subprocess interface for child options.
> ---
>
> Key: MESOS-5070
> URL: https://issues.apache.org/jira/browse/MESOS-5070
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Joerg Schad
>Assignee: Joerg Schad
>  Labels: tech-debt
>
> We introduced a number of parameters to the subprocess interface with 
> MESOS-5049.
> Adding all options explicitly to the subprocess interface makes it 
> inflexible. 
> We should investigate a flexible options, which still prevents arbitrary code 
> to be executed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5874) Only send ShutdownFrameworkMessage to agents associated with framework.

2016-09-29 Thread Alexander Rukletsov (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15532282#comment-15532282
 ] 

Alexander Rukletsov commented on MESOS-5874:


{noformat}
Commit: 88b2da4b2fcebd9724b17292af27a4a4c99e584f [88b2da4]
Author: Jacob Janco 
Date: 29 September 2016 at 11:09:05 GMT+2
Committer: Alexander Rukletsov 
Commit Date: 29 September 2016 at 11:15:23 GMT+2

Tuned agent logging for ShutdownFrameworkMessage.

ShutdownFrameworkMessage is broadcast to all agents in the cluster.
For high framework clusters this message leads to agent log saturation.
On completion of MESOS-1961, implementing executor state reconciliation,
this message can be targeted at agents per MESOS-5784.

Review: https://reviews.apache.org/r/52371/
{noformat}

> Only send ShutdownFrameworkMessage to agents associated with framework.
> ---
>
> Key: MESOS-5874
> URL: https://issues.apache.org/jira/browse/MESOS-5874
> Project: Mesos
>  Issue Type: Improvement
>  Components: master
>Reporter: Jacob Janco
>Assignee: Jacob Janco
>Priority: Minor
>  Labels: mesosphere
> Fix For: 1.1.0
>
>
> slave.cpp:2079] Asked to shut down framework ${framework} by master@${master}
> slave.cpp:2094] Cannot shut down unknown framework ${framework} 
> For high framework/churn clusters this saturates agent logs with these 
> messages. When a framework terminates a ShutdownFrameworkMessage is sent to 
> every registered slave in a for loop. This patch proposes sending this 
> message to agents with executors associated with the framework. 
> Also proposed is moving the logline to VLOG(1). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6236) Launch subprocesses associated with specified namespaces.

2016-09-26 Thread Alexander Rukletsov (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15523030#comment-15523030
 ] 

Alexander Rukletsov commented on MESOS-6236:


Stout looks like a better place for them anyway, right?

> Launch subprocesses associated with specified namespaces.
> -
>
> Key: MESOS-6236
> URL: https://issues.apache.org/jira/browse/MESOS-6236
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Qian Zhang
>Assignee: Qian Zhang
>  Labels: mesosphere
> Fix For: 1.1.0
>
>
> Currently there is no standard way in Mesos to launch a child process in a 
> different namespace (e.g. {{net}}, {{mnt}}). A user may leverage 
> {{Subprocess}} and provide its own {{clone}} callback, but this approach is 
> error-prone.
> One possible solution is to implement a {{Subprocess}}' child hook. In 
> [MESOS-5070|https://issues.apache.org/jira/browse/MESOS-5070], we have 
> introduced a child hook framework in subprocess and implemented three child 
> hooks {{CHDIR}}, {{SETSID}} and {{SUPERVISOR}}. We suggest to introduce 
> another child hook {{SETNS}} so that other components (e.g., health check) 
> can call it to enter the namespaces of a specific process.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6254) Child hooks are not async signal safe.

2016-09-26 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-6254:
---
Description: 
[Some child 
hooks|https://github.com/apache/mesos/blob/ec4c81a12559030791334359e7e1e2b6565cce01/3rdparty/libprocess/src/subprocess.cpp#L67]
 create an {{Error}} instance, which is strictly speaking not async signal safe.

We should either refactor {{Error}} to be no-throw or avoid using it in child 
hooks.

  was:
[Some child 
hooks|https://github.com/apache/mesos/blob/ec4c81a12559030791334359e7e1e2b6565cce01/3rdparty/libprocess/src/subprocess.cpp#L67]
 create an {{Error}} instance, which is strictly speaking no async signal safe.

We should either refactor {{Error}} to be no-throw or avoid using it in child 
hooks.


> Child hooks are not async signal safe.
> --
>
> Key: MESOS-6254
> URL: https://issues.apache.org/jira/browse/MESOS-6254
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.0.0, 1.0.1
>Reporter: Alexander Rukletsov
>  Labels: mesosphere
>
> [Some child 
> hooks|https://github.com/apache/mesos/blob/ec4c81a12559030791334359e7e1e2b6565cce01/3rdparty/libprocess/src/subprocess.cpp#L67]
>  create an {{Error}} instance, which is strictly speaking not async signal 
> safe.
> We should either refactor {{Error}} to be no-throw or avoid using it in child 
> hooks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6254) Child hooks are not async signal safe.

2016-09-26 Thread Alexander Rukletsov (JIRA)
Alexander Rukletsov created MESOS-6254:
--

 Summary: Child hooks are not async signal safe.
 Key: MESOS-6254
 URL: https://issues.apache.org/jira/browse/MESOS-6254
 Project: Mesos
  Issue Type: Bug
Affects Versions: 1.0.1, 1.0.0
Reporter: Alexander Rukletsov


[Some child 
hooks|https://github.com/apache/mesos/blob/ec4c81a12559030791334359e7e1e2b6565cce01/3rdparty/libprocess/src/subprocess.cpp#L67]
 create an {{Error}} instance, which is strictly speaking no async signal safe.

We should either refactor {{Error}} to be no-throw or avoid using it in child 
hooks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6288) The default executor should maintain launcher_dir.

2016-09-29 Thread Alexander Rukletsov (JIRA)
Alexander Rukletsov created MESOS-6288:
--

 Summary: The default executor should maintain launcher_dir.
 Key: MESOS-6288
 URL: https://issues.apache.org/jira/browse/MESOS-6288
 Project: Mesos
  Issue Type: Bug
Reporter: Alexander Rukletsov
Assignee: Gastón Kleiman
 Fix For: 1.1.0


Both command and docker executors require {{launcher_dir}} is provided in a 
flag. This directory contains mesos binaries, e.g. a tcp checker necessary for 
TCP health check. The default executor should obtain somehow (a flag, env var) 
and maintain this directory for health checker to use.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6278) Add test cases for the HTTP health checks.

2016-10-27 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-6278:
---
Summary: Add test cases for the HTTP health checks.  (was: Add test cases 
for the HTTP health checks)

> Add test cases for the HTTP health checks.
> --
>
> Key: MESOS-6278
> URL: https://issues.apache.org/jira/browse/MESOS-6278
> Project: Mesos
>  Issue Type: Task
>  Components: tests
>Reporter: haosdent
>Assignee: haosdent
>  Labels: health-check, mesosphere, test
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4875) overlayfs does not work when launching tasks.

2016-11-09 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-4875:
---
Summary: overlayfs does not work when launching tasks.  (was: overlayfs 
does not work when lauching tasks)

> overlayfs does not work when launching tasks.
> -
>
> Key: MESOS-4875
> URL: https://issues.apache.org/jira/browse/MESOS-4875
> Project: Mesos
>  Issue Type: Bug
>Reporter: Guangya Liu
>Assignee: Guangya Liu
> Fix For: 1.0.0
>
>
> Enable the overlay backend and launch a task, the task failed to start. Check 
> executor log, found the followinig:
> {code}
> Failed to create sandbox mount point  at 
> '/tmp/mesos/slaves/bbc41bda-747a-420e-88d2-cf100fa8b6d5-S1/frameworks/bbc41bda-747a-420e-88d2-cf100fa8b6d5-0001/executors/test_mesos/runs/3736fb2a-de7a-4aba-9b08-25c73be7879f/.rootfs/mnt/mesos/sandbox':
>  Read-only file system
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4875) overlayfs does not work when launching tasks.

2016-11-09 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-4875:
---
Shepherd: Jie Yu

> overlayfs does not work when launching tasks.
> -
>
> Key: MESOS-4875
> URL: https://issues.apache.org/jira/browse/MESOS-4875
> Project: Mesos
>  Issue Type: Bug
>Reporter: Guangya Liu
>Assignee: Guangya Liu
> Fix For: 1.0.0
>
>
> Enable the overlay backend and launch a task, the task failed to start. Check 
> executor log, found the followinig:
> {code}
> Failed to create sandbox mount point  at 
> '/tmp/mesos/slaves/bbc41bda-747a-420e-88d2-cf100fa8b6d5-S1/frameworks/bbc41bda-747a-420e-88d2-cf100fa8b6d5-0001/executors/test_mesos/runs/3736fb2a-de7a-4aba-9b08-25c73be7879f/.rootfs/mnt/mesos/sandbox':
>  Read-only file system
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4875) overlayfs does not work when launching tasks.

2016-11-09 Thread Alexander Rukletsov (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15650638#comment-15650638
 ] 

Alexander Rukletsov commented on MESOS-4875:


Please make sure the "shepherd" field is properly set.

> overlayfs does not work when launching tasks.
> -
>
> Key: MESOS-4875
> URL: https://issues.apache.org/jira/browse/MESOS-4875
> Project: Mesos
>  Issue Type: Bug
>Reporter: Guangya Liu
>Assignee: Guangya Liu
> Fix For: 1.0.0
>
>
> Enable the overlay backend and launch a task, the task failed to start. Check 
> executor log, found the followinig:
> {code}
> Failed to create sandbox mount point  at 
> '/tmp/mesos/slaves/bbc41bda-747a-420e-88d2-cf100fa8b6d5-S1/frameworks/bbc41bda-747a-420e-88d2-cf100fa8b6d5-0001/executors/test_mesos/runs/3736fb2a-de7a-4aba-9b08-25c73be7879f/.rootfs/mnt/mesos/sandbox':
>  Read-only file system
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4404) SlaveTest.HTTPSchedulerSlaveRestart is flaky

2016-11-10 Thread Alexander Rukletsov (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15654101#comment-15654101
 ] 

Alexander Rukletsov commented on MESOS-4404:


Saw one more failure in Apache CI:
{noformat}
[ RUN  ] SlaveTest.HTTPSchedulerSlaveRestart
I1110 01:47:03.995028 30949 cluster.cpp:158] Creating default 'local' authorizer
I1110 01:47:03.997442 30949 leveldb.cpp:174] Opened db in 2.129383ms
I1110 01:47:03.997962 30949 leveldb.cpp:181] Compacted db in 489488ns
I1110 01:47:03.998003 30949 leveldb.cpp:196] Created db iterator in 17567ns
I1110 01:47:03.998016 30949 leveldb.cpp:202] Seeked to beginning of db in 1688ns
I1110 01:47:03.998024 30949 leveldb.cpp:271] Iterated through 0 keys in the db 
in 347ns
I1110 01:47:03.998062 30949 replica.cpp:776] Replica recovered with log 
positions 0 -> 0 with 1 holes and 0 unlearned
I1110 01:47:03.998610 30982 recover.cpp:451] Starting replica recovery
I1110 01:47:03.998934 30982 recover.cpp:477] Replica is in EMPTY status
I1110 01:47:04.000113 30980 replica.cpp:673] Replica in EMPTY status received a 
broadcasted recover request from __req_res__(6273)@172.17.0.2:55933
I1110 01:47:04.000543 30968 recover.cpp:197] Received a recover response from a 
replica in EMPTY status
I1110 01:47:04.000973 30981 recover.cpp:568] Updating replica status to STARTING
I1110 01:47:04.001570 30968 leveldb.cpp:304] Persisting metadata (8 bytes) to 
leveldb took 385848ns
I1110 01:47:04.001596 30968 replica.cpp:320] Persisted replica status to 
STARTING
I1110 01:47:04.001797 30972 recover.cpp:477] Replica is in STARTING status
I1110 01:47:04.002784 30983 master.cpp:380] Master 
0c3821c4-3478-4e0b-8621-6ed22c08bdda (2aa403d92175) started on 172.17.0.2:55933
I1110 01:47:04.002820 30983 master.cpp:382] Flags at startup: --acls="" 
--agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
--allocation_interval="1secs" --allocator="HierarchicalDRF" 
--authenticate_agents="true" --authenticate_frameworks="true" 
--authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
--authenticate_http_readwrite="true" --authenticators="crammd5" 
--authorizers="local" --credentials="/tmp/NkUSvj/credentials" 
--framework_sorter="drf" --help="false" --hostname_lookup="true" 
--http_authenticators="basic" --http_framework_authenticators="basic" 
--initialize_driver_logging="true" --log_auto_initialize="true" 
--logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" 
--max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" 
--quiet="false" --recovery_agent_removal_limit="100%" 
--registry="replicated_log" --registry_fetch_timeout="1mins" 
--registry_gc_interval="15mins" --registry_max_agent_age="2weeks" 
--registry_max_agent_count="102400" --registry_store_timeout="100secs" 
--registry_strict="false" --root_submissions="true" --user_sorter="drf" 
--version="false" --webui_dir="/mesos/mesos-1.2.0/_inst/share/mesos/webui" 
--work_dir="/tmp/NkUSvj/master" --zk_session_timeout="10secs"
I1110 01:47:04.003515 30983 master.cpp:432] Master only allowing authenticated 
frameworks to register
I1110 01:47:04.003500 30981 replica.cpp:673] Replica in STARTING status 
received a broadcasted recover request from __req_res__(6274)@172.17.0.2:55933
I1110 01:47:04.003552 30983 master.cpp:446] Master only allowing authenticated 
agents to register
I1110 01:47:04.003653 30983 master.cpp:459] Master only allowing authenticated 
HTTP frameworks to register
I1110 01:47:04.003700 30983 credentials.hpp:37] Loading credentials for 
authentication from '/tmp/NkUSvj/credentials'
I1110 01:47:04.003860 30981 recover.cpp:197] Received a recover response from a 
replica in STARTING status
I1110 01:47:04.004154 30983 master.cpp:504] Using default 'crammd5' 
authenticator
I1110 01:47:04.004365 30983 http.cpp:895] Using default 'basic' HTTP 
authenticator for realm 'mesos-master-readonly'
I1110 01:47:04.004638 30983 http.cpp:895] Using default 'basic' HTTP 
authenticator for realm 'mesos-master-readwrite'
I1110 01:47:04.004794 30983 http.cpp:895] Using default 'basic' HTTP 
authenticator for realm 'mesos-master-scheduler'
I1110 01:47:04.004885 30975 recover.cpp:568] Updating replica status to VOTING
I1110 01:47:04.005070 30983 master.cpp:584] Authorization enabled
I1110 01:47:04.005408 30980 leveldb.cpp:304] Persisting metadata (8 bytes) to 
leveldb took 286810ns
I1110 01:47:04.005434 30980 replica.cpp:320] Persisted replica status to VOTING
I1110 01:47:04.005494 30970 whitelist_watcher.cpp:77] No whitelist given
I1110 01:47:04.005511 30969 hierarchical.cpp:149] Initialized hierarchical 
allocator process
I1110 01:47:04.005564 30980 recover.cpp:582] Successfully joined the Paxos group
I1110 01:47:04.005795 30980 recover.cpp:466] Recover process terminated
I1110 01:47:04.008816 30978 master.cpp:2033] Elected as the leading master!
I1110 01:47:04.008860 30978 master.cpp:1560] Recovering from 

[jira] [Created] (MESOS-6417) Introduce an extra 'unknown' health check state.

2016-10-19 Thread Alexander Rukletsov (JIRA)
Alexander Rukletsov created MESOS-6417:
--

 Summary: Introduce an extra 'unknown' health check state.
 Key: MESOS-6417
 URL: https://issues.apache.org/jira/browse/MESOS-6417
 Project: Mesos
  Issue Type: Improvement
Reporter: Alexander Rukletsov


There are three logical states regarding health checks:
1) no health checks;
2) a health check is defined, but no result is available yet;
3) a health check is defined, it is either healthy or not.

Currently, we do not distinguish between 1) and 2), which can be problematic 
for framework authors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4292) Tests for quota with implicit roles.

2016-10-19 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-4292:
---
Shepherd: Alexander Rukletsov
Assignee: Zhitao Li

> Tests for quota with implicit roles.
> 
>
> Key: MESOS-4292
> URL: https://issues.apache.org/jira/browse/MESOS-4292
> Project: Mesos
>  Issue Type: Task
>  Components: test
>Reporter: Alexander Rukletsov
>Assignee: Zhitao Li
>  Labels: mesosphere
>
> With the introduction of implicit roles (MESOS-3988), we should make sure 
> quota can be set for an inactive role (unknown to the master) and maybe 
> transition it to the active state.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-3959) Executor page of mesos ui does not show slave hostname.

2016-10-14 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-3959:
---
Summary: Executor page of mesos ui does not show slave hostname.  (was: 
Executor page of mesos ui does not show slave hostname)

> Executor page of mesos ui does not show slave hostname.
> ---
>
> Key: MESOS-3959
> URL: https://issues.apache.org/jira/browse/MESOS-3959
> Project: Mesos
>  Issue Type: Bug
>  Components: webui
>Reporter: Ian Babrou
>
> This is not really convenient.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6028) mesos-execute has a typo in volume help.

2016-10-14 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-6028:
---
Summary: mesos-execute has a typo in volume help.  (was: typo in 
mesos-execute usage)

> mesos-execute has a typo in volume help.
> 
>
> Key: MESOS-6028
> URL: https://issues.apache.org/jira/browse/MESOS-6028
> Project: Mesos
>  Issue Type: Documentation
>Reporter: Stéphane Cottin
>Assignee: Tomasz Janiszewski
>Priority: Minor
>
> s/docker_options/driver_options/g



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6344) Allow `network/cni` isolator to take a search path for CNI plugins instead of single directory

2016-10-14 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-6344:
---
Priority: Blocker  (was: Major)

> Allow `network/cni` isolator to take a search path for CNI plugins instead of 
> single directory
> --
>
> Key: MESOS-6344
> URL: https://issues.apache.org/jira/browse/MESOS-6344
> Project: Mesos
>  Issue Type: Task
>  Components: containerization
>Reporter: Avinash Sridharan
>Assignee: Avinash Sridharan
>Priority: Blocker
>  Labels: mesosphere
>
> Currently the `network/cni` isolator expects a single directory with the 
> `--network_cni_plugins_dir` . This is very limiting because this forces the 
> operator to put all the CNI plugins in the same directory. 
> With Mesos port-mapper CNI plugin this would also imply that the operator 
> would have to move this plugin from the Mesos installation directory to a 
> directory specified in the `--network_cni_plugins_dir`. 
> To simplify the operators experience it would make sense for the 
> `--network_cni_plugins_dir` flag to take in set of directories instead of 
> single directory. The `network/cni` isolator can then search this set of 
> directories to find the CNI plugin.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-6035) Add non-recursive version of cgroups::get

2016-10-14 Thread Alexander Rukletsov (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15575167#comment-15575167
 ] 

Alexander Rukletsov edited comment on MESOS-6035 at 10/14/16 12:16 PM:
---

{noformat}
Commit: fcd5106b5dfa14bc83eae68415bd4782c16f79a4 [fcd5106]
Author: Alexander Rukletsov 
Date: 14 October 2016 at 14:10:23 GMT+2
Commit Date: 14 October 2016 at 14:15:02 GMT+2

Revert "Removed the expired TODO about non-recursive version...
`cgroups::get`."

This reverts commit e042aa071a77ef1922d9b1a93f6e8adf221979b3.

RR https://reviews.apache.org/r/51185/ should have been committed
together with https://reviews.apache.org/r/51031/. However, the
latter is not going to make it into the 1.1.0 release, hence the
former is reverted now to avoid confusion.
{noformat}


was (Author: alexr):
{noformat}
Commit: fcd5106b5dfa14bc83eae68415bd4782c16f79a4 [fcd5106]
Parents: 9fc2901d23
Author: Alexander Rukletsov 
Date: 14 October 2016 at 14:10:23 GMT+2
Commit Date: 14 October 2016 at 14:15:02 GMT+2
Labels: HEAD -> master

Revert "Removed the expired TODO about non-recursive version...
`cgroups::get`."

This reverts commit e042aa071a77ef1922d9b1a93f6e8adf221979b3.

RR https://reviews.apache.org/r/51185/ should have been committed
together with https://reviews.apache.org/r/51031/. However, the
latter is not going to make it into the 1.1.0 release, hence the
former is reverted now to avoid confusion.
{noformat}

> Add non-recursive version of cgroups::get
> -
>
> Key: MESOS-6035
> URL: https://issues.apache.org/jira/browse/MESOS-6035
> Project: Mesos
>  Issue Type: Improvement
>Reporter: haosdent
>Assignee: haosdent
>Priority: Minor
>
> In some cases, we only need to get the top level cgroups instead of to get 
> all cgroups recursively. Add a non-recursive version could help to avoid 
> unnecessary paths.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6035) Add non-recursive version of cgroups::get

2016-10-14 Thread Alexander Rukletsov (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15575167#comment-15575167
 ] 

Alexander Rukletsov commented on MESOS-6035:


{noformat}
Commit: fcd5106b5dfa14bc83eae68415bd4782c16f79a4 [fcd5106]
Parents: 9fc2901d23
Author: Alexander Rukletsov 
Date: 14 October 2016 at 14:10:23 GMT+2
Commit Date: 14 October 2016 at 14:15:02 GMT+2
Labels: HEAD -> master

Revert "Removed the expired TODO about non-recursive version...
`cgroups::get`."

This reverts commit e042aa071a77ef1922d9b1a93f6e8adf221979b3.

RR https://reviews.apache.org/r/51185/ should have been committed
together with https://reviews.apache.org/r/51031/. However, the
latter is not going to make it into the 1.1.0 release, hence the
former is reverted now to avoid confusion.
{noformat}

> Add non-recursive version of cgroups::get
> -
>
> Key: MESOS-6035
> URL: https://issues.apache.org/jira/browse/MESOS-6035
> Project: Mesos
>  Issue Type: Improvement
>Reporter: haosdent
>Assignee: haosdent
>Priority: Minor
>
> In some cases, we only need to get the top level cgroups instead of to get 
> all cgroups recursively. Add a non-recursive version could help to avoid 
> unnecessary paths.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5963) HealthChecker should not decide when to kill tasks and when to stop performing health checks.

2016-10-14 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-5963:
---
Shepherd: Benjamin Mahler
Assignee: Alexander Rukletsov  (was: haosdent)
  Sprint: Mesosphere Sprint 45
Target Version/s: 1.2.0

> HealthChecker should not decide when to kill tasks and when to stop 
> performing health checks.
> -
>
> Key: MESOS-5963
> URL: https://issues.apache.org/jira/browse/MESOS-5963
> Project: Mesos
>  Issue Type: Bug
>Reporter: Alexander Rukletsov
>Assignee: Alexander Rukletsov
>  Labels: health-check, mesosphere
>
> Currently, {{HealthChecker}} library decides when a task should be killed 
> based on its health status. Moreover, it stops checking it health after that. 
> This seems unfortunate, because it's up to the executor and / or framework to 
> decide both when to kill tasks and when to health check them. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6395) HealthChecker sends updates to executor via libprocess messaging.

2016-10-14 Thread Alexander Rukletsov (JIRA)
Alexander Rukletsov created MESOS-6395:
--

 Summary: HealthChecker sends updates to executor via libprocess 
messaging.
 Key: MESOS-6395
 URL: https://issues.apache.org/jira/browse/MESOS-6395
 Project: Mesos
  Issue Type: Improvement
Reporter: Alexander Rukletsov
Assignee: Alexander Rukletsov


Currently {{HealthChecker}} sends status updates via libprocess messaging to 
the executor's UPID. This seems unnecessary after refactoring health checker 
into the library: a simple callback will do. Moreover, not requiring executor's 
{{UPID}} will simplify creating a mocked {{HealthChecker}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6134) Port CFS quota support to Docker Containerizer using command executor.

2016-10-14 Thread Alexander Rukletsov (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15575096#comment-15575096
 ] 

Alexander Rukletsov commented on MESOS-6134:


This issue is delaying the 1.1.0 release and shows no progress in the last 
days. It is retargeted for 1.2.0.

> Port CFS quota support to Docker Containerizer using command executor.
> --
>
> Key: MESOS-6134
> URL: https://issues.apache.org/jira/browse/MESOS-6134
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, docker
>Reporter: Zhitao Li
>Assignee: Zhitao Li
>
> MESOS-2154 only partially fixed the CFS quota support in Docker 
> Containerizer: that fix only works for custom executor.
> This tracks the fix for command executor so we can declare this is complete.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6134) Port CFS quota support to Docker Containerizer using command executor.

2016-10-14 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-6134:
---
Target Version/s: 1.2.0  (was: 1.1.0)

> Port CFS quota support to Docker Containerizer using command executor.
> --
>
> Key: MESOS-6134
> URL: https://issues.apache.org/jira/browse/MESOS-6134
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, docker
>Reporter: Zhitao Li
>Assignee: Zhitao Li
>
> MESOS-2154 only partially fixed the CFS quota support in Docker 
> Containerizer: that fix only works for custom executor.
> This tracks the fix for command executor so we can declare this is complete.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6391) Command task's sandbox should not be owned by root if it uses container image.

2016-10-14 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-6391:
---
Priority: Blocker  (was: Major)

> Command task's sandbox should not be owned by root if it uses container image.
> --
>
> Key: MESOS-6391
> URL: https://issues.apache.org/jira/browse/MESOS-6391
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.28.2, 1.0.1
>Reporter: Jie Yu
>Assignee: Jie Yu
>Priority: Blocker
>
> Currently, if the task defines a container image, the command executor will 
> be run under root because it needs to perform pivot_root.
> That means if the task wants to run under an unprivileged user, the sandbox 
> of that task will not be writable because it's owned by root.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6394) Improvements to partition-aware Mesos frameworks.

2016-10-14 Thread Alexander Rukletsov (JIRA)
Alexander Rukletsov created MESOS-6394:
--

 Summary: Improvements to partition-aware Mesos frameworks.
 Key: MESOS-6394
 URL: https://issues.apache.org/jira/browse/MESOS-6394
 Project: Mesos
  Issue Type: Epic
  Components: master
Reporter: Alexander Rukletsov
Assignee: Neil Conway


This is a follow up epic to MESOS-5344 to capture further improvements and 
changes that need to be made to the MVP.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6014) Create a CNI plugin that provides port mapping functionality for various CNI plugins.

2016-10-14 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-6014:
---
Priority: Blocker  (was: Major)

> Create a CNI plugin that provides port mapping functionality for various CNI 
> plugins.
> -
>
> Key: MESOS-6014
> URL: https://issues.apache.org/jira/browse/MESOS-6014
> Project: Mesos
>  Issue Type: Epic
>  Components: containerization
> Environment: Linux
>Reporter: Avinash Sridharan
>Assignee: Avinash Sridharan
>Priority: Blocker
>  Labels: mesosphere
>
> Currently there is no CNI plugin that supports port mapping. Given that the 
> unified containerizer is starting to become the de-facto container run time, 
> having  a CNI plugin that provides port mapping is a must have. This is 
> primarily required for support BRIDGE networking mode, similar to docker 
> bridge networking that users expect to have when using docker containers. 
> While the most obvious use case is that of using the port-mapper plugin with 
> the bridge plugin, the port-mapping functionality itself is generic and 
> should be usable with any CNI plugin that needs it.
> Keeping port-mapping as a CNI plugin gives operators the ability to use the 
> default port-mapper (CNI plugin) that Mesos provides, or use their own plugin.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6335) Add user doc for task group tasks

2016-10-14 Thread Alexander Rukletsov (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15575094#comment-15575094
 ] 

Alexander Rukletsov commented on MESOS-6335:


Is it still to land in 1.1.0?

> Add user doc for task group tasks
> -
>
> Key: MESOS-6335
> URL: https://issues.apache.org/jira/browse/MESOS-6335
> Project: Mesos
>  Issue Type: Documentation
>Reporter: Vinod Kone
>Assignee: Vinod Kone
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-2449) Support group of tasks (Pod) constructs and API in Mesos.

2016-10-14 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-2449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-2449:
---
Priority: Blocker  (was: Major)

> Support group of tasks (Pod) constructs and API in Mesos.
> -
>
> Key: MESOS-2449
> URL: https://issues.apache.org/jira/browse/MESOS-2449
> Project: Mesos
>  Issue Type: Epic
>Reporter: Timothy Chen
>Priority: Blocker
>  Labels: mesosphere
>
> There is a common need among different frameworks, that wants to start a 
> group of tasks that are either depend or co-located with each other.
> Although a framework can schedule individual tasks within the same offer and 
> slave id, it doesn't have a way to describe dependencies, failure policies 
> (if one of the task failed), network setup, and group container information, 
> etc.
> Want to create a epic to start the discussion around the requirements folks 
> need, and see where we can lead this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5344) Partition-aware Mesos frameworks.

2016-10-14 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-5344:
---
Summary: Partition-aware Mesos frameworks.  (was: Partition-aware Mesos 
frameworks)

> Partition-aware Mesos frameworks.
> -
>
> Key: MESOS-5344
> URL: https://issues.apache.org/jira/browse/MESOS-5344
> Project: Mesos
>  Issue Type: Epic
>  Components: master
>Reporter: Neil Conway
>Assignee: Neil Conway
>Priority: Blocker
>  Labels: mesosphere
>
> This epic covers three related tasks:
> 1. Allowing partitioned agents to reregister with the master. This allows 
> frameworks to control how tasks running on partitioned agents should be dealt 
> with.
> 2. Replacing the TASK_LOST task state with a set of more granular states with 
> more precise semantics: UNREACHABLE, DROPPED, UNKNOWN, GONE, and 
> GONE_BY_OPERATOR.
> 3. Allow frameworks to be informed when a task that was running on a 
> partitioned agent has been terminated (GONE and GONE_BY_OPERATOR states).
> These new behaviors will be guarded by the {{PARTITION_AWARE}} framework 
> capability.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6376) Add documentation for capabilities support of the mesos containerizer

2016-10-17 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-6376:
---
Target Version/s: 1.2.0  (was: 1.1.0)
Priority: Major  (was: Blocker)

> Add documentation for capabilities support of the mesos containerizer
> -
>
> Key: MESOS-6376
> URL: https://issues.apache.org/jira/browse/MESOS-6376
> Project: Mesos
>  Issue Type: Task
>  Components: containerization
>Reporter: Benjamin Bannier
>Assignee: Benjamin Bannier
>  Labels: mesosphere
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6420) Mesos Agent leaking sockets when port mapping network isolator is ON

2016-10-25 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-6420:
---
Target Version/s: 1.1.0  (was: 1.1.1, 1.2.0)

> Mesos Agent leaking sockets when port mapping network isolator is ON
> 
>
> Key: MESOS-6420
> URL: https://issues.apache.org/jira/browse/MESOS-6420
> Project: Mesos
>  Issue Type: Bug
>  Components: isolation, network, slave
>Affects Versions: 1.0.2
>Reporter: Santhosh Shanmugham
> Fix For: 1.0.2, 1.2.0
>
>
> Mesos Agent leaks one socket per task launched and eventually runs out of 
> sockets. We were able to track it down to the network isolator 
> (port_mapping.cpp). When we turned off the port mapping isolator no file 
> descriptors where leaked. The leaked fd is a SOCK_STREAM socket.
> Leaked Sockets:
> $ sudo lsof -p $(pgrep -u root -o -f /usr/local/sbin/mesos-slave) -nP | grep 
> "can't"
> [sudo] password for sshanmugham:
> mesos-sla 57688 root   19u  sock0,6  0t0 2993216948 can't 
> identify protocol
> mesos-sla 57688 root   27u  sock0,6  0t0 2993216468 can't 
> identify protocol
> Extract from strace:
> ...
> [pid 57701] 19:14:02.493718 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.494395 close(19)   = 0
> [pid 57701] 19:14:02.494448 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.494844 close(19)   = 0
> [pid 57701] 19:14:02.494913 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.495565 close(19)   = 0
> [pid 57701] 19:14:02.495617 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.496072 close(19)   = 0
> [pid 57701] 19:14:02.496128 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.496758 close(19)   = 0
> [pid 57701] 19:14:02.496812 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.497270 close(19)   = 0
> [pid 57701] 19:14:02.497319 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.497698 close(19)   = 0
> [pid 57701] 19:14:02.497750 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.498407 close(19)   = 0
> [pid 57701] 19:14:02.498456 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.498899 close(19)   = 0
> [pid 57701] 19:14:02.498963 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 63682] 19:14:02.499091 close(18 
> [pid 57701] 19:14:02.499634 close(19)   = 0
> [pid 57701] 19:14:02.499689 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.500044 close(19)   = 0
> [pid 57701] 19:14:02.500093 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.500734 close(19)   = 0
> [pid 57701] 19:14:02.500782 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.501271 close(19)   = 0
> [pid 57701] 19:14:02.501339 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.502030 close(19)   = 0
> [pid 57701] 19:14:02.502101 socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 19
> ...
> ...
> [pid 57691] 19:18:03.461022 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid  6138] 19:18:03.461345 open("/etc/selinux/config", O_RDONLY  ...>
> [pid 57691] 19:18:03.461460 close(27)   = 0
> [pid 57691] 19:18:03.461520 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid  6138] 19:18:03.461632 close(3 
> [pid  6138] 19:18:03.461781 open("/proc/mounts", O_RDONLY 
> [pid  6138] 19:18:03.462190 close(3 
> [pid 57691] 19:18:03.462374 close(27)   = 0
> [pid 57691] 19:18:03.462430 socket(PF_NETLINK, SOCK_RAW, 0 
> [pid  6138] 19:18:03.462456 open("/proc/net/psched", O_RDONLY 
> [pid  6138] 19:18:03.462678 close(3 
> [pid  6138] 19:18:03.462915 open("/etc/libnl/classid", O_RDONLY  ...>
> [pid 57691] 19:18:03.463046 close(27)   = 0
> [pid 57691] 19:18:03.463111 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid  6138] 19:18:03.463225 close(3 
> [pid 57691] 19:18:03.463845 close(27)   = 0
> [pid 57691] 19:18:03.463911 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid 57691] 19:18:03.464604 close(27)   = 0
> [pid 57691] 19:18:03.464664 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid 57691] 19:18:03.465074 close(27)   = 0
> [pid 57691] 19:18:03.465132 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid 57691] 19:18:03.465862 close(27)   = 0
> [pid 57691] 19:18:03.465928 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid 57691] 19:18:03.466713 close(27)   = 0
> [pid 57691] 19:18:03.466780 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid 57691] 19:18:03.467472 close(27)   = 0
> [pid 57691] 19:18:03.467524 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid 57691] 19:18:03.468012 close(27)   = 0
> [pid 57691] 19:18:03.468075 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid 57691] 19:18:03.468799 close(27)   = 0
> [pid 57691] 19:18:03.468950 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid 57691] 19:18:03.469505 close(27)   = 0
> [pid 57691] 19:18:03.469578 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid 57691] 19:18:03.470301 close(27)   = 0
> [pid 57691] 

[jira] [Updated] (MESOS-6446) WebUI redirect doesn't work with stats from /metric/snapshot

2016-10-25 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-6446:
---
Target Version/s: 1.0.2, 1.1.0  (was: 1.0.2)

> WebUI redirect doesn't work with stats from /metric/snapshot
> 
>
> Key: MESOS-6446
> URL: https://issues.apache.org/jira/browse/MESOS-6446
> Project: Mesos
>  Issue Type: Bug
>  Components: webui
>Affects Versions: 1.0.0
>Reporter: Yan Xu
>Assignee: haosdent
>Priority: Blocker
> Attachments: Screen Shot 2016-10-21 at 12.04.23 PM.png
>
>
> After Mesos 1.0, the webUI redirect is hidden from the users so you can go to 
> any of the master and the webUI is populated with state.json from the leading 
> master. 
> This doesn't include stats from /metric/snapshot though as it is not 
> redirected. The user ends up seeing some fields with empty values.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6455) DefaultExecutorTests fail when running on hosts without docker

2016-10-25 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-6455:
---
Target Version/s: 1.1.0

> DefaultExecutorTests fail when running on hosts without docker 
> ---
>
> Key: MESOS-6455
> URL: https://issues.apache.org/jira/browse/MESOS-6455
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Affects Versions: 1.1.0
>Reporter: Yan Xu
>
> {noformat:title=}
> [  FAILED  ] Containterizers/DefaultExecutorTest.ROOT_TaskRunning/1, where 
> GetParam() = "docker,mesos"
> [  FAILED  ] Containterizers/DefaultExecutorTest.ROOT_KillTask/1, where 
> GetParam() = "docker,mesos"
> [  FAILED  ] Containterizers/DefaultExecutorTest.ROOT_TaskUsesExecutor/1, 
> where GetParam() = "docker,mesos"
> {noformat}
> {noformat:title=}
> ../../src/tests/default_executor_tests.cpp:98: Failure
> slave: Failed to create containerizer: Could not create DockerContainerizer: 
> Failed to create docker: Failed to get docker version: Failed to execute 
> 'docker -H unix:///var/run/docker.sock --version': exited with status 127
> {noformat}
> Maybe we can put {{DOCKER_}} in the instantiation name and use another 
> instantiation for tests that don't require docker?
> /cc [~vinodkone] [~anandmazumdar]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6420) Mesos Agent leaking sockets when port mapping network isolator is ON

2016-10-25 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-6420:
---
Shepherd: Jie Yu
Target Version/s: 1.0.2, 1.1.0, 1.2.0  (was: 1.1.0)

> Mesos Agent leaking sockets when port mapping network isolator is ON
> 
>
> Key: MESOS-6420
> URL: https://issues.apache.org/jira/browse/MESOS-6420
> Project: Mesos
>  Issue Type: Bug
>  Components: isolation, network, slave
>Affects Versions: 1.0.2
>Reporter: Santhosh Shanmugham
> Fix For: 1.0.2, 1.2.0
>
>
> Mesos Agent leaks one socket per task launched and eventually runs out of 
> sockets. We were able to track it down to the network isolator 
> (port_mapping.cpp). When we turned off the port mapping isolator no file 
> descriptors where leaked. The leaked fd is a SOCK_STREAM socket.
> Leaked Sockets:
> $ sudo lsof -p $(pgrep -u root -o -f /usr/local/sbin/mesos-slave) -nP | grep 
> "can't"
> [sudo] password for sshanmugham:
> mesos-sla 57688 root   19u  sock0,6  0t0 2993216948 can't 
> identify protocol
> mesos-sla 57688 root   27u  sock0,6  0t0 2993216468 can't 
> identify protocol
> Extract from strace:
> ...
> [pid 57701] 19:14:02.493718 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.494395 close(19)   = 0
> [pid 57701] 19:14:02.494448 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.494844 close(19)   = 0
> [pid 57701] 19:14:02.494913 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.495565 close(19)   = 0
> [pid 57701] 19:14:02.495617 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.496072 close(19)   = 0
> [pid 57701] 19:14:02.496128 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.496758 close(19)   = 0
> [pid 57701] 19:14:02.496812 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.497270 close(19)   = 0
> [pid 57701] 19:14:02.497319 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.497698 close(19)   = 0
> [pid 57701] 19:14:02.497750 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.498407 close(19)   = 0
> [pid 57701] 19:14:02.498456 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.498899 close(19)   = 0
> [pid 57701] 19:14:02.498963 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 63682] 19:14:02.499091 close(18 
> [pid 57701] 19:14:02.499634 close(19)   = 0
> [pid 57701] 19:14:02.499689 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.500044 close(19)   = 0
> [pid 57701] 19:14:02.500093 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.500734 close(19)   = 0
> [pid 57701] 19:14:02.500782 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.501271 close(19)   = 0
> [pid 57701] 19:14:02.501339 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.502030 close(19)   = 0
> [pid 57701] 19:14:02.502101 socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 19
> ...
> ...
> [pid 57691] 19:18:03.461022 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid  6138] 19:18:03.461345 open("/etc/selinux/config", O_RDONLY  ...>
> [pid 57691] 19:18:03.461460 close(27)   = 0
> [pid 57691] 19:18:03.461520 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid  6138] 19:18:03.461632 close(3 
> [pid  6138] 19:18:03.461781 open("/proc/mounts", O_RDONLY 
> [pid  6138] 19:18:03.462190 close(3 
> [pid 57691] 19:18:03.462374 close(27)   = 0
> [pid 57691] 19:18:03.462430 socket(PF_NETLINK, SOCK_RAW, 0 
> [pid  6138] 19:18:03.462456 open("/proc/net/psched", O_RDONLY 
> [pid  6138] 19:18:03.462678 close(3 
> [pid  6138] 19:18:03.462915 open("/etc/libnl/classid", O_RDONLY  ...>
> [pid 57691] 19:18:03.463046 close(27)   = 0
> [pid 57691] 19:18:03.463111 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid  6138] 19:18:03.463225 close(3 
> [pid 57691] 19:18:03.463845 close(27)   = 0
> [pid 57691] 19:18:03.463911 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid 57691] 19:18:03.464604 close(27)   = 0
> [pid 57691] 19:18:03.464664 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid 57691] 19:18:03.465074 close(27)   = 0
> [pid 57691] 19:18:03.465132 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid 57691] 19:18:03.465862 close(27)   = 0
> [pid 57691] 19:18:03.465928 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid 57691] 19:18:03.466713 close(27)   = 0
> [pid 57691] 19:18:03.466780 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid 57691] 19:18:03.467472 close(27)   = 0
> [pid 57691] 19:18:03.467524 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid 57691] 19:18:03.468012 close(27)   = 0
> [pid 57691] 19:18:03.468075 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid 57691] 19:18:03.468799 close(27)   = 0
> [pid 57691] 19:18:03.468950 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid 57691] 19:18:03.469505 close(27)   = 0
> [pid 57691] 19:18:03.469578 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid 57691] 19:18:03.470301 

[jira] [Updated] (MESOS-6420) Mesos Agent leaking sockets when port mapping network isolator is ON

2016-10-25 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-6420:
---
Priority: Blocker  (was: Major)

> Mesos Agent leaking sockets when port mapping network isolator is ON
> 
>
> Key: MESOS-6420
> URL: https://issues.apache.org/jira/browse/MESOS-6420
> Project: Mesos
>  Issue Type: Bug
>  Components: isolation, network, slave
>Affects Versions: 1.0.2
>Reporter: Santhosh Shanmugham
>Priority: Blocker
> Fix For: 1.0.2, 1.2.0
>
>
> Mesos Agent leaks one socket per task launched and eventually runs out of 
> sockets. We were able to track it down to the network isolator 
> (port_mapping.cpp). When we turned off the port mapping isolator no file 
> descriptors where leaked. The leaked fd is a SOCK_STREAM socket.
> Leaked Sockets:
> $ sudo lsof -p $(pgrep -u root -o -f /usr/local/sbin/mesos-slave) -nP | grep 
> "can't"
> [sudo] password for sshanmugham:
> mesos-sla 57688 root   19u  sock0,6  0t0 2993216948 can't 
> identify protocol
> mesos-sla 57688 root   27u  sock0,6  0t0 2993216468 can't 
> identify protocol
> Extract from strace:
> ...
> [pid 57701] 19:14:02.493718 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.494395 close(19)   = 0
> [pid 57701] 19:14:02.494448 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.494844 close(19)   = 0
> [pid 57701] 19:14:02.494913 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.495565 close(19)   = 0
> [pid 57701] 19:14:02.495617 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.496072 close(19)   = 0
> [pid 57701] 19:14:02.496128 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.496758 close(19)   = 0
> [pid 57701] 19:14:02.496812 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.497270 close(19)   = 0
> [pid 57701] 19:14:02.497319 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.497698 close(19)   = 0
> [pid 57701] 19:14:02.497750 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.498407 close(19)   = 0
> [pid 57701] 19:14:02.498456 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.498899 close(19)   = 0
> [pid 57701] 19:14:02.498963 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 63682] 19:14:02.499091 close(18 
> [pid 57701] 19:14:02.499634 close(19)   = 0
> [pid 57701] 19:14:02.499689 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.500044 close(19)   = 0
> [pid 57701] 19:14:02.500093 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.500734 close(19)   = 0
> [pid 57701] 19:14:02.500782 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.501271 close(19)   = 0
> [pid 57701] 19:14:02.501339 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.502030 close(19)   = 0
> [pid 57701] 19:14:02.502101 socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 19
> ...
> ...
> [pid 57691] 19:18:03.461022 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid  6138] 19:18:03.461345 open("/etc/selinux/config", O_RDONLY  ...>
> [pid 57691] 19:18:03.461460 close(27)   = 0
> [pid 57691] 19:18:03.461520 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid  6138] 19:18:03.461632 close(3 
> [pid  6138] 19:18:03.461781 open("/proc/mounts", O_RDONLY 
> [pid  6138] 19:18:03.462190 close(3 
> [pid 57691] 19:18:03.462374 close(27)   = 0
> [pid 57691] 19:18:03.462430 socket(PF_NETLINK, SOCK_RAW, 0 
> [pid  6138] 19:18:03.462456 open("/proc/net/psched", O_RDONLY 
> [pid  6138] 19:18:03.462678 close(3 
> [pid  6138] 19:18:03.462915 open("/etc/libnl/classid", O_RDONLY  ...>
> [pid 57691] 19:18:03.463046 close(27)   = 0
> [pid 57691] 19:18:03.463111 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid  6138] 19:18:03.463225 close(3 
> [pid 57691] 19:18:03.463845 close(27)   = 0
> [pid 57691] 19:18:03.463911 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid 57691] 19:18:03.464604 close(27)   = 0
> [pid 57691] 19:18:03.464664 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid 57691] 19:18:03.465074 close(27)   = 0
> [pid 57691] 19:18:03.465132 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid 57691] 19:18:03.465862 close(27)   = 0
> [pid 57691] 19:18:03.465928 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid 57691] 19:18:03.466713 close(27)   = 0
> [pid 57691] 19:18:03.466780 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid 57691] 19:18:03.467472 close(27)   = 0
> [pid 57691] 19:18:03.467524 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid 57691] 19:18:03.468012 close(27)   = 0
> [pid 57691] 19:18:03.468075 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid 57691] 19:18:03.468799 close(27)   = 0
> [pid 57691] 19:18:03.468950 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid 57691] 19:18:03.469505 close(27)   = 0
> [pid 57691] 19:18:03.469578 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid 57691] 19:18:03.470301 close(27)   = 0
> [pid 

[jira] [Commented] (MESOS-6420) Mesos Agent leaking sockets when port mapping network isolator is ON

2016-10-25 Thread Alexander Rukletsov (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15605995#comment-15605995
 ] 

Alexander Rukletsov commented on MESOS-6420:


[~jieyu] could you cherry-pick it to 1.1.x?

> Mesos Agent leaking sockets when port mapping network isolator is ON
> 
>
> Key: MESOS-6420
> URL: https://issues.apache.org/jira/browse/MESOS-6420
> Project: Mesos
>  Issue Type: Bug
>  Components: isolation, network, slave
>Affects Versions: 1.0.2
>Reporter: Santhosh Shanmugham
>Priority: Blocker
> Fix For: 1.0.2, 1.2.0
>
>
> Mesos Agent leaks one socket per task launched and eventually runs out of 
> sockets. We were able to track it down to the network isolator 
> (port_mapping.cpp). When we turned off the port mapping isolator no file 
> descriptors where leaked. The leaked fd is a SOCK_STREAM socket.
> Leaked Sockets:
> $ sudo lsof -p $(pgrep -u root -o -f /usr/local/sbin/mesos-slave) -nP | grep 
> "can't"
> [sudo] password for sshanmugham:
> mesos-sla 57688 root   19u  sock0,6  0t0 2993216948 can't 
> identify protocol
> mesos-sla 57688 root   27u  sock0,6  0t0 2993216468 can't 
> identify protocol
> Extract from strace:
> ...
> [pid 57701] 19:14:02.493718 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.494395 close(19)   = 0
> [pid 57701] 19:14:02.494448 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.494844 close(19)   = 0
> [pid 57701] 19:14:02.494913 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.495565 close(19)   = 0
> [pid 57701] 19:14:02.495617 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.496072 close(19)   = 0
> [pid 57701] 19:14:02.496128 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.496758 close(19)   = 0
> [pid 57701] 19:14:02.496812 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.497270 close(19)   = 0
> [pid 57701] 19:14:02.497319 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.497698 close(19)   = 0
> [pid 57701] 19:14:02.497750 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.498407 close(19)   = 0
> [pid 57701] 19:14:02.498456 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.498899 close(19)   = 0
> [pid 57701] 19:14:02.498963 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 63682] 19:14:02.499091 close(18 
> [pid 57701] 19:14:02.499634 close(19)   = 0
> [pid 57701] 19:14:02.499689 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.500044 close(19)   = 0
> [pid 57701] 19:14:02.500093 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.500734 close(19)   = 0
> [pid 57701] 19:14:02.500782 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.501271 close(19)   = 0
> [pid 57701] 19:14:02.501339 socket(PF_NETLINK, SOCK_RAW, 0) = 19
> [pid 57701] 19:14:02.502030 close(19)   = 0
> [pid 57701] 19:14:02.502101 socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 19
> ...
> ...
> [pid 57691] 19:18:03.461022 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid  6138] 19:18:03.461345 open("/etc/selinux/config", O_RDONLY  ...>
> [pid 57691] 19:18:03.461460 close(27)   = 0
> [pid 57691] 19:18:03.461520 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid  6138] 19:18:03.461632 close(3 
> [pid  6138] 19:18:03.461781 open("/proc/mounts", O_RDONLY 
> [pid  6138] 19:18:03.462190 close(3 
> [pid 57691] 19:18:03.462374 close(27)   = 0
> [pid 57691] 19:18:03.462430 socket(PF_NETLINK, SOCK_RAW, 0 
> [pid  6138] 19:18:03.462456 open("/proc/net/psched", O_RDONLY 
> [pid  6138] 19:18:03.462678 close(3 
> [pid  6138] 19:18:03.462915 open("/etc/libnl/classid", O_RDONLY  ...>
> [pid 57691] 19:18:03.463046 close(27)   = 0
> [pid 57691] 19:18:03.463111 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid  6138] 19:18:03.463225 close(3 
> [pid 57691] 19:18:03.463845 close(27)   = 0
> [pid 57691] 19:18:03.463911 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid 57691] 19:18:03.464604 close(27)   = 0
> [pid 57691] 19:18:03.464664 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid 57691] 19:18:03.465074 close(27)   = 0
> [pid 57691] 19:18:03.465132 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid 57691] 19:18:03.465862 close(27)   = 0
> [pid 57691] 19:18:03.465928 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid 57691] 19:18:03.466713 close(27)   = 0
> [pid 57691] 19:18:03.466780 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid 57691] 19:18:03.467472 close(27)   = 0
> [pid 57691] 19:18:03.467524 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid 57691] 19:18:03.468012 close(27)   = 0
> [pid 57691] 19:18:03.468075 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid 57691] 19:18:03.468799 close(27)   = 0
> [pid 57691] 19:18:03.468950 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid 57691] 19:18:03.469505 close(27)   = 0
> [pid 57691] 19:18:03.469578 socket(PF_NETLINK, SOCK_RAW, 0) = 27
> [pid 

[jira] [Commented] (MESOS-2092) Make ACLs dynamic

2016-11-14 Thread Alexander Rukletsov (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15664132#comment-15664132
 ] 

Alexander Rukletsov commented on MESOS-2092:


Does not look like. [~gradywang]?

> Make ACLs dynamic
> -
>
> Key: MESOS-2092
> URL: https://issues.apache.org/jira/browse/MESOS-2092
> Project: Mesos
>  Issue Type: Task
>  Components: security
>Reporter: Alexander Rukletsov
>Assignee: Yongqiao Wang
>  Labels: mesosphere, newbie
>
> Master loads ACLs once during its launch and there is no way to update them 
> in a running master. Making them dynamic will allow for updating ACLs on the 
> fly, for example granting a new framework necessary rights.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-1802) HealthCheckTest.HealthStatusChange is flaky on jenkins.

2016-11-24 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-1802:
---
 Shepherd: Alexander Rukletsov
   Sprint: Mesosphere Sprint 48
Affects Version/s: 0.27.3
   0.28.2
   1.0.0
   1.0.1
 Story Points: 5
 Target Version/s: 1.2.0
 Priority: Minor  (was: Major)

> HealthCheckTest.HealthStatusChange is flaky on jenkins.
> ---
>
> Key: MESOS-1802
> URL: https://issues.apache.org/jira/browse/MESOS-1802
> Project: Mesos
>  Issue Type: Bug
>  Components: test, tests
>Affects Versions: 0.26.0, 0.27.3, 0.28.2, 1.0.0, 1.0.1
>Reporter: Benjamin Mahler
>Assignee: haosdent
>Priority: Minor
>  Labels: flaky, health-check, mesosphere
> Attachments: health_check_flaky_test_log.txt
>
>
> https://builds.apache.org/job/Mesos-Trunk-Ubuntu-Build-Out-Of-Src-Disable-Java-Disable-Python-Disable-Webui/2374/consoleFull
> {noformat}
> [ RUN  ] HealthCheckTest.HealthStatusChange
> Using temporary directory '/tmp/HealthCheckTest_HealthStatusChange_IYnlu2'
> I0916 22:56:14.034612 21026 leveldb.cpp:176] Opened db in 2.155713ms
> I0916 22:56:14.034965 21026 leveldb.cpp:183] Compacted db in 332489ns
> I0916 22:56:14.034984 21026 leveldb.cpp:198] Created db iterator in 3710ns
> I0916 22:56:14.034996 21026 leveldb.cpp:204] Seeked to beginning of db in 
> 642ns
> I0916 22:56:14.035006 21026 leveldb.cpp:273] Iterated through 0 keys in the 
> db in 343ns
> I0916 22:56:14.035023 21026 replica.cpp:741] Replica recovered with log 
> positions 0 -> 0 with 1 holes and 0 unlearned
> I0916 22:56:14.035200 21054 recover.cpp:425] Starting replica recovery
> I0916 22:56:14.035403 21041 recover.cpp:451] Replica is in EMPTY status
> I0916 22:56:14.035888 21045 replica.cpp:638] Replica in EMPTY status received 
> a broadcasted recover request
> I0916 22:56:14.035969 21052 recover.cpp:188] Received a recover response from 
> a replica in EMPTY status
> I0916 22:56:14.036118 21042 recover.cpp:542] Updating replica status to 
> STARTING
> I0916 22:56:14.036603 21046 master.cpp:286] Master 
> 20140916-225614-3125920579-47865-21026 (penates.apache.org) started on 
> 67.195.81.186:47865
> I0916 22:56:14.036634 21046 master.cpp:332] Master only allowing 
> authenticated frameworks to register
> I0916 22:56:14.036648 21046 master.cpp:337] Master only allowing 
> authenticated slaves to register
> I0916 22:56:14.036659 21046 credentials.hpp:36] Loading credentials for 
> authentication from 
> '/tmp/HealthCheckTest_HealthStatusChange_IYnlu2/credentials'
> I0916 22:56:14.036686 21045 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 480322ns
> I0916 22:56:14.036700 21045 replica.cpp:320] Persisted replica status to 
> STARTING
> I0916 22:56:14.036769 21046 master.cpp:366] Authorization enabled
> I0916 22:56:14.036826 21045 recover.cpp:451] Replica is in STARTING status
> I0916 22:56:14.036944 21052 master.cpp:120] No whitelist given. Advertising 
> offers for all slaves
> I0916 22:56:14.036968 21049 hierarchical_allocator_process.hpp:299] 
> Initializing hierarchical allocator process with master : 
> master@67.195.81.186:47865
> I0916 22:56:14.037284 21054 replica.cpp:638] Replica in STARTING status 
> received a broadcasted recover request
> I0916 22:56:14.037312 21046 master.cpp:1212] The newly elected leader is 
> master@67.195.81.186:47865 with id 20140916-225614-3125920579-47865-21026
> I0916 22:56:14.037333 21046 master.cpp:1225] Elected as the leading master!
> I0916 22:56:14.037345 21046 master.cpp:1043] Recovering from registrar
> I0916 22:56:14.037504 21040 registrar.cpp:313] Recovering registrar
> I0916 22:56:14.037505 21053 recover.cpp:188] Received a recover response from 
> a replica in STARTING status
> I0916 22:56:14.037681 21047 recover.cpp:542] Updating replica status to VOTING
> I0916 22:56:14.038072 21052 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 330251ns
> I0916 22:56:14.038087 21052 replica.cpp:320] Persisted replica status to 
> VOTING
> I0916 22:56:14.038127 21053 recover.cpp:556] Successfully joined the Paxos 
> group
> I0916 22:56:14.038202 21053 recover.cpp:440] Recover process terminated
> I0916 22:56:14.038364 21048 log.cpp:656] Attempting to start the writer
> I0916 22:56:14.038812 21053 replica.cpp:474] Replica received implicit 
> promise request with proposal 1
> I0916 22:56:14.038925 21053 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 92623ns
> I0916 22:56:14.038944 21053 replica.cpp:342] Persisted promised to 1
> I0916 22:56:14.039201 21052 coordinator.cpp:230] Coordinator attemping to 
> fill missing position
> I0916 22:56:14.039676 21047 

[jira] [Commented] (MESOS-1802) HealthCheckTest.HealthStatusChange is flaky on jenkins.

2016-11-24 Thread Alexander Rukletsov (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15694334#comment-15694334
 ] 

Alexander Rukletsov commented on MESOS-1802:


I can reproduce it relatively easy by running _parallel_ {{make check}}. Here 
is a fresh log:
{noformat}
[ RUN  ] HealthCheckTest.HealthStatusChange
I1124 23:20:48.351884 4284416 exec.cpp:162] Version: 1.2.0
I1124 23:20:48.375592 3747840 exec.cpp:237] Executor registered on agent 
6db7ef4d-7211-47be-98ba-ad590b528c69-S0
Received SUBSCRIBED event
Subscribed executor on alexr.speedportneo09012801000249
Received LAUNCH event
Starting task 1
/Users/alex/Projects/mesos/build/parallel/src/mesos-containerizer launch 
--command="{"shell":true,"value":"sleep 120"}" --help="false"
Forked command at 73286
Received task health update, healthy: true
rm: /private/tmp/z1PbfH/rG7Gha: No such file or directory
W1124 23:20:48.544631 3211264 health_checker.cpp:245] Health check failed 1 
times consecutively: COMMAND health check failed: Command returned exited with 
status 1
Received task health update, healthy: false
Received task health update, healthy: true
rm: /private/tmp/z1PbfH/rG7Gha: No such file or directory
../../../src/tests/health_check_tests.cpp:790: Failure
Value of: (find).get()
  Actual: 16-byte object <05-00 00-00 00-00 00-00 60-A9 62-1B B2-7F 00-00>
Expected: false
Which is: false
I1124 23:20:48.732457 4284416 exec.cpp:414] Executor asked to shutdown
Received SHUTDOWN event
Shutting down
Sending SIGTERM to process tree at pid 73286
W1124 23:20:48.747885 1064960 health_checker.cpp:245] Health check failed 1 
times consecutively: COMMAND health check failed: Command returned exited with 
status 1
rm: /private/tmp/z1PbfH/rG7Gha: No such file or directory
W1124 23:20:48.948562 3747840 health_checker.cpp:245] Health check failed 1 
times consecutively: COMMAND health check failed: Command returned exited with 
status 1
Sent SIGTERM to the following process trees:
[ 
--- 73286 sleep 120
]
Scheduling escalation to SIGKILL in 3secs from now
[  FAILED  ] HealthCheckTest.HealthStatusChange (1639 ms)
{noformat}

These lines
{noformat}
Received task health update, healthy: true
rm: /private/tmp/z1PbfH/rG7Gha: No such file or directory
../../../src/tests/health_check_tests.cpp:790: Failure
{noformat}
obviously hint that we've queried the HTTP endpoint _after_ the next health 
status change.

> HealthCheckTest.HealthStatusChange is flaky on jenkins.
> ---
>
> Key: MESOS-1802
> URL: https://issues.apache.org/jira/browse/MESOS-1802
> Project: Mesos
>  Issue Type: Bug
>  Components: test, tests
>Affects Versions: 0.26.0
>Reporter: Benjamin Mahler
>Assignee: haosdent
>  Labels: flaky, health-check, mesosphere
> Attachments: health_check_flaky_test_log.txt
>
>
> https://builds.apache.org/job/Mesos-Trunk-Ubuntu-Build-Out-Of-Src-Disable-Java-Disable-Python-Disable-Webui/2374/consoleFull
> {noformat}
> [ RUN  ] HealthCheckTest.HealthStatusChange
> Using temporary directory '/tmp/HealthCheckTest_HealthStatusChange_IYnlu2'
> I0916 22:56:14.034612 21026 leveldb.cpp:176] Opened db in 2.155713ms
> I0916 22:56:14.034965 21026 leveldb.cpp:183] Compacted db in 332489ns
> I0916 22:56:14.034984 21026 leveldb.cpp:198] Created db iterator in 3710ns
> I0916 22:56:14.034996 21026 leveldb.cpp:204] Seeked to beginning of db in 
> 642ns
> I0916 22:56:14.035006 21026 leveldb.cpp:273] Iterated through 0 keys in the 
> db in 343ns
> I0916 22:56:14.035023 21026 replica.cpp:741] Replica recovered with log 
> positions 0 -> 0 with 1 holes and 0 unlearned
> I0916 22:56:14.035200 21054 recover.cpp:425] Starting replica recovery
> I0916 22:56:14.035403 21041 recover.cpp:451] Replica is in EMPTY status
> I0916 22:56:14.035888 21045 replica.cpp:638] Replica in EMPTY status received 
> a broadcasted recover request
> I0916 22:56:14.035969 21052 recover.cpp:188] Received a recover response from 
> a replica in EMPTY status
> I0916 22:56:14.036118 21042 recover.cpp:542] Updating replica status to 
> STARTING
> I0916 22:56:14.036603 21046 master.cpp:286] Master 
> 20140916-225614-3125920579-47865-21026 (penates.apache.org) started on 
> 67.195.81.186:47865
> I0916 22:56:14.036634 21046 master.cpp:332] Master only allowing 
> authenticated frameworks to register
> I0916 22:56:14.036648 21046 master.cpp:337] Master only allowing 
> authenticated slaves to register
> I0916 22:56:14.036659 21046 credentials.hpp:36] Loading credentials for 
> authentication from 
> '/tmp/HealthCheckTest_HealthStatusChange_IYnlu2/credentials'
> I0916 22:56:14.036686 21045 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 480322ns
> I0916 22:56:14.036700 21045 replica.cpp:320] Persisted replica status to 
> STARTING
> I0916 22:56:14.036769 21046 

[jira] [Updated] (MESOS-6002) The whiteout file cannot be removed correctly using aufs backend.

2016-11-24 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-6002:
---
Target Version/s: 1.2.0

> The whiteout file cannot be removed correctly using aufs backend.
> -
>
> Key: MESOS-6002
> URL: https://issues.apache.org/jira/browse/MESOS-6002
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
> Environment: Ubuntu 14, Ubuntu 12
> Or any os with aufs module
>Reporter: Gilbert Song
>Assignee: Qian Zhang
>  Labels: aufs, backend, containerizer
> Fix For: 1.1.1
>
> Attachments: whiteout.diff
>
>
> The whiteout file is not removed correctly when using the aufs backend in 
> unified containerizer. It can be verified by this unit test with the aufs 
> manually specified.
> {noformat}
> [20:11:24] :   [Step 10/10] [ RUN  ] 
> ProvisionerDockerPullerTest.ROOT_INTERNET_CURL_Whiteout
> [20:11:24]W:   [Step 10/10] I0805 20:11:24.986734 24295 cluster.cpp:155] 
> Creating default 'local' authorizer
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.001153 24295 leveldb.cpp:174] 
> Opened db in 14.308627ms
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.003731 24295 leveldb.cpp:181] 
> Compacted db in 2.558329ms
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.003749 24295 leveldb.cpp:196] 
> Created db iterator in 3086ns
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.003754 24295 leveldb.cpp:202] 
> Seeked to beginning of db in 595ns
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.003758 24295 leveldb.cpp:271] 
> Iterated through 0 keys in the db in 314ns
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.003769 24295 replica.cpp:776] 
> Replica recovered with log positions 0 -> 0 with 1 holes and 0 unlearned
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.004086 24315 recover.cpp:451] 
> Starting replica recovery
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.004251 24312 recover.cpp:477] 
> Replica is in EMPTY status
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.004546 24314 replica.cpp:673] 
> Replica in EMPTY status received a broadcasted recover request from 
> __req_res__(5640)@172.30.2.105:36006
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.004607 24312 recover.cpp:197] 
> Received a recover response from a replica in EMPTY status
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.004762 24313 recover.cpp:568] 
> Updating replica status to STARTING
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.004776 24314 master.cpp:375] 
> Master 21665992-d47e-402f-a00c-6f8fab613019 (ip-172-30-2-105.mesosphere.io) 
> started on 172.30.2.105:36006
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.004787 24314 master.cpp:377] Flags 
> at startup: --acls="" --agent_ping_timeout="15secs" 
> --agent_reregister_timeout="10mins" --allocation_interval="1secs" 
> --allocator="HierarchicalDRF" --authenticate_agents="true" 
> --authenticate_frameworks="true" --authenticate_http_frameworks="true" 
> --authenticate_http_readonly="true" --authenticate_http_readwrite="true" 
> --authenticators="crammd5" --authorizers="local" 
> --credentials="/tmp/0z753P/credentials" --framework_sorter="drf" 
> --help="false" --hostname_lookup="true" --http_authenticators="basic" 
> --http_framework_authenticators="basic" --initialize_driver_logging="true" 
> --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" 
> --max_agent_ping_timeouts="5" --max_completed_frameworks="50" 
> --max_completed_tasks_per_framework="1000" --quiet="false" 
> --recovery_agent_removal_limit="100%" --registry="replicated_log" 
> --registry_fetch_timeout="1mins" --registry_store_timeout="100secs" 
> --registry_strict="true" --root_submissions="true" --user_sorter="drf" 
> --version="false" --webui_dir="/usr/local/share/mesos/webui" 
> --work_dir="/tmp/0z753P/master" --zk_session_timeout="10secs"
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.004920 24314 master.cpp:427] 
> Master only allowing authenticated frameworks to register
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.004930 24314 master.cpp:441] 
> Master only allowing authenticated agents to register
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.004935 24314 master.cpp:454] 
> Master only allowing authenticated HTTP frameworks to register
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.004942 24314 credentials.hpp:37] 
> Loading credentials for authentication from '/tmp/0z753P/credentials'
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.005018 24314 master.cpp:499] Using 
> default 'crammd5' authenticator
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.005101 24314 http.cpp:883] Using 
> default 'basic' HTTP authenticator for realm 'mesos-master-readonly'
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.005152 24314 http.cpp:883] Using 
> default 'basic' HTTP authenticator for realm 'mesos-master-readwrite'
> [20:11:25]W: 

[jira] [Updated] (MESOS-6360) The handling of whiteout files in provisioner is not correct.

2016-11-24 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-6360:
---
Target Version/s: 1.2.0  (was: 1.1.1)
   Fix Version/s: 1.2.0
  1.1.1

> The handling of whiteout files in provisioner is not correct.
> -
>
> Key: MESOS-6360
> URL: https://issues.apache.org/jira/browse/MESOS-6360
> Project: Mesos
>  Issue Type: Bug
>Reporter: Qian Zhang
>Assignee: Qian Zhang
>Priority: Blocker
> Fix For: 1.1.1, 1.2.0
>
>
> Currently when user launches a container from a Docker image via universal 
> containerizer, we always handle the whiteout files in 
> {{ProvisionerProcess::__provision()}} regardless of which backend is used.
> However this is actually not correct, because the way to handle whiteout 
> files is backend dependent, that means for different backends, we need to 
> handle whiteout files in different ways, e.g.:
> * AUFS backend: It seems the AUFS whiteout ({{.wh.}} and 
> {{.wh..wh..opq}}) is the whiteout standard in Docker (see [this comment | 
> https://github.com/docker/docker/blob/v1.12.1/pkg/archive/archive.go#L259:L262]
>  for details), so that means after the Docker image is pulled, its whiteout 
> files in the store are already in aufs format, then we do not need to do 
> anything about whiteout file handling because the aufs mount done in 
> {{AufsBackendProcess::provision()}} will handle it automatically.
> * Overlay backend: Overlayfs has its own whiteout files (see [this doc | 
> https://www.kernel.org/doc/Documentation/filesystems/overlayfs.txt] for 
> details), so we need to convert the aufs whiteout files to overlayfs whiteout 
> files before we do the overlay mount in {{OverlayBackendProcess::provision}} 
> which will automatically handle the overlayfs whiteout files.
> * Copy backend: We need to manually handle the aufs whiteout files when we 
> copy each layer in {{CopyBackendProcess::_provision()}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


<    4   5   6   7   8   9   10   11   12   13   >