[jira] [Created] (MESOS-8734) Restore `WaitAfterDestroy` test to check termination status of a terminated nested container.

2018-03-27 Thread Andrei Budnik (JIRA)
Andrei Budnik created MESOS-8734:


 Summary: Restore `WaitAfterDestroy` test to check termination 
status of a terminated nested container.
 Key: MESOS-8734
 URL: https://issues.apache.org/jira/browse/MESOS-8734
 Project: Mesos
  Issue Type: Task
Reporter: Andrei Budnik


It's important to check that after termination of a nested container, its 
termination status is
available. This property is used in default executor.

Right now, if we remove [this section of 
code|https://github.com/apache/mesos/blob/5b655ce062ff55cdefed119d97ad923aeeb2efb5/src/slave/containerizer/mesos/containerizer.cpp#L2093-L2111],
 no test will be broken!

https://reviews.apache.org/r/65505



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8732) Use composing containerizer by default in tests.

2018-03-27 Thread Andrei Budnik (JIRA)
Andrei Budnik created MESOS-8732:


 Summary: Use composing containerizer by default in tests.
 Key: MESOS-8732
 URL: https://issues.apache.org/jira/browse/MESOS-8732
 Project: Mesos
  Issue Type: Task
  Components: containerization
Reporter: Andrei Budnik


If we assign "docker,mesos" to the `containerizers` flag for an agent, then 
`ComposingContainerizer` will be used for many tests that do not specify 
`containerizers` flag. That's the goal of this task.

I tried to do that by adding [`flags.containerizers = 
"docker,mesos";`|https://github.com/apache/mesos/blob/master/src/tests/mesos.cpp#L273],
 but it turned out that some tests are started to hang due to a paused clocks, 
while docker c'zer and docker library use libprocess clocks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8729) Libprocess: deadlock in process::finalize

2018-03-23 Thread Andrei Budnik (JIRA)
Andrei Budnik created MESOS-8729:


 Summary: Libprocess: deadlock in process::finalize
 Key: MESOS-8729
 URL: https://issues.apache.org/jira/browse/MESOS-8729
 Project: Mesos
  Issue Type: Bug
  Components: libprocess
Affects Versions: 1.6.0
 Environment: The issue has been reproduced on Ubuntu 16.04 on mesos 
master brunch, commit `42848653b2`. 
Reporter: Andrei Budnik
 Attachments: deadlock.txt

Since we are calling 
[`libprocess::finalize()`|https://github.com/apache/mesos/blob/02ebf9986ab5ce883a71df72e9e3392a3e37e40e/src/slave/containerizer/mesos/io/switchboard_main.cpp#L157]
 before returning from the IOSwitchboard's main function, we expect that all 
http responses are going to be send back to clients before IOSwitchboard 
terminates. However, after [adding|https://reviews.apache.org/r/66147/] 
`libprocess::finalize()` we have seen that IOSwitchboard might get stuck on 
`libprocess::finalize()`. See attached stacktrace.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8714) Cleanup `containers_` hashmap once container exits

2018-03-22 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16410015#comment-16410015
 ] 

Andrei Budnik commented on MESOS-8714:
--

Composing c'zer 
[subscribes|https://github.com/apache/mesos/blob/5b655ce062ff55cdefed119d97ad923aeeb2efb5/src/slave/containerizer/composing.cpp#L356-L357]
 on container termination after successful launch. So we always clean up this 
hash map.
After changes in composing c'zer, this invariant (that we always clean up 
terminated containers) should remain unchanged.
I think that there should be only one place, where we do cleanup: 
`ComposingContainerizerProcess::_launch`.

> Cleanup `containers_` hashmap once container exits
> --
>
> Key: MESOS-8714
> URL: https://issues.apache.org/jira/browse/MESOS-8714
> Project: Mesos
>  Issue Type: Task
>Reporter: Andrei Budnik
>Priority: Major
>
> To clean up a `containers_` hash map in composing c'zer, we need to subscribe 
> on a container termination event in `_launch` method. Also, it's desirable to 
> limit the number of places where we do the clean up.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8714) Cleanup `containers_` hashmap once container exits

2018-03-21 Thread Andrei Budnik (JIRA)
Andrei Budnik created MESOS-8714:


 Summary: Cleanup `containers_` hashmap once container exits
 Key: MESOS-8714
 URL: https://issues.apache.org/jira/browse/MESOS-8714
 Project: Mesos
  Issue Type: Task
Reporter: Andrei Budnik


To clean up a `containers_` hash map in composing c'zer, we need to subscribe 
on a container termination event in `_launch` method. Also, it's desirable to 
limit the number of places where we do the clean up.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8713) Synchronize result of `wait` and `destroy` composing c'zer methods

2018-03-21 Thread Andrei Budnik (JIRA)
Andrei Budnik created MESOS-8713:


 Summary: Synchronize result of `wait` and `destroy` composing 
c'zer methods
 Key: MESOS-8713
 URL: https://issues.apache.org/jira/browse/MESOS-8713
 Project: Mesos
  Issue Type: Task
Reporter: Andrei Budnik


Make sure both `wait` and `destroy` methods always return the same result.
For example, if we call `destroy` for a terminated nested container, then 
composing c'zer returns `false`/`None`, while `wait` method calls `wait` for a 
parent container, which might read a container termination status from the 
file. Probably, we need to implement a test for this case, if it doesn't exist.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8712) Remove `destroyed` promise from `Container` struct

2018-03-21 Thread Andrei Budnik (JIRA)
Andrei Budnik created MESOS-8712:


 Summary: Remove `destroyed` promise from `Container` struct
 Key: MESOS-8712
 URL: https://issues.apache.org/jira/browse/MESOS-8712
 Project: Mesos
  Issue Type: Task
Reporter: Andrei Budnik


[`destroyed` 
promise|https://github.com/apache/mesos/blob/5d8a9c1b77f96151da859b4c0c3607d22c36cd18/src/slave/containerizer/composing.cpp#L138]
 is not needed anymore, since we can use the property that `wait` and `destroy` 
methods depend on the same container termination promise. This change should 
affect only composing c'zer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8710) Update tests after changing return type of `wait` method

2018-03-21 Thread Andrei Budnik (JIRA)
Andrei Budnik created MESOS-8710:


 Summary: Update tests after changing return type of `wait` method
 Key: MESOS-8710
 URL: https://issues.apache.org/jira/browse/MESOS-8710
 Project: Mesos
  Issue Type: Task
Reporter: Andrei Budnik


Changing return type of `wait` methods requires corresponding changes in tests.
Here is a known list of test source files that need to be updated:
{code:java}
src/tests/containerizer.{hpp,cpp}
src/tests/containerizer/composing_containerizer_tests.cpp
src/tests/containerizer/docker_containerizer_tests.cpp
src/tests/containerizer/io_switchboard_tests.cpp
src/tests/containerizer/mesos_containerizer_tests.cpp
src/tests/containerizer/mock_containerizer.hpp{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8706) Unify return type of `wait` and `destroy` containerizer methods

2018-03-21 Thread Andrei Budnik (JIRA)
Andrei Budnik created MESOS-8706:


 Summary: Unify return type of `wait` and `destroy` containerizer 
methods
 Key: MESOS-8706
 URL: https://issues.apache.org/jira/browse/MESOS-8706
 Project: Mesos
  Issue Type: Task
Reporter: Andrei Budnik


We want to unify return type of both `destroy` and `wait` methods for a 
containerizer, because they depend on the same container termination promise in 
our built-in mesos and docker containerizers. That gives us an opportunity to 
simplify launch and destroy logic in composing containerizer.

The return type of `destroy()` methods should be changed from:
{code:java}
Future destroy(const ContainerID& containerId);
{code}
to
{code:java}
Future

[jira] [Created] (MESOS-8705) Composing containerizer improvements

2018-03-21 Thread Andrei Budnik (JIRA)
Andrei Budnik created MESOS-8705:


 Summary: Composing containerizer improvements
 Key: MESOS-8705
 URL: https://issues.apache.org/jira/browse/MESOS-8705
 Project: Mesos
  Issue Type: Epic
  Components: containerization
Reporter: Andrei Budnik


This epic is meant to collect composing containerizer-related issues and 
improvements.
The goals are to simplify composing containerizer, to cover it with tests, to 
clarify and improve documentation for containerizer interface.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-8258) Mesos.DockerContainerizerTest.ROOT_DOCKER_SlaveRecoveryTaskContainer is flaky.

2018-03-20 Thread Andrei Budnik (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrei Budnik reassigned MESOS-8258:


Assignee: Andrei Budnik

> Mesos.DockerContainerizerTest.ROOT_DOCKER_SlaveRecoveryTaskContainer is flaky.
> --
>
> Key: MESOS-8258
> URL: https://issues.apache.org/jira/browse/MESOS-8258
> Project: Mesos
>  Issue Type: Bug
>  Components: test
> Environment: Ubuntu 16.04
> Ubuntu 17.04
> Debian 9
>Reporter: Alexander Rukletsov
>Assignee: Andrei Budnik
>Priority: Major
>  Labels: flaky-test
> Attachments: ROOT_DOCKER_SlaveRecoveryTaskContainer-badrun.txt, 
> ROOT_DOCKER_SlaveRecoveryTaskContainer-badrun2.txt, 
> ROOT_DOCKER_SlaveRecoveryTaskContainer-goodrun.txt
>
>
> {noformat}
> /home/ubuntu/workspace/mesos/Mesos_CI-build/FLAG/CMake/label/mesos-ec2-ubuntu-17.04/mesos/src/tests/containerizer/docker_containerizer_tests.cpp:2772
>   Expected: 1
> To be equal to: reregister.updates_size()
>   Which is: 2
> {noformat}
> Full log attached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-8258) Mesos.DockerContainerizerTest.ROOT_DOCKER_SlaveRecoveryTaskContainer is flaky.

2018-03-20 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16406290#comment-16406290
 ] 

Andrei Budnik edited comment on MESOS-8258 at 3/20/18 4:03 PM:
---

Steps to reproduce:
 1. Add `::sleep(2);` at 
[https://github.com/apache/mesos/blob/2c41b039da20cf3904cee4272673d9e3791f5184/src/tests/containerizer/docker_containerizer_tests.cpp#L2745|https://github.com/apache/mesos/blob/2c41b039da20cf3904cee4272673d9e3791f5184/src/tests/containerizer/docker_containerizer_tests.cpp#L2743]
 2. Recompile and run `sudo make check 
GTEST_FILTER=DockerContainerizerTest.ROOT_DOCKER_SlaveRecoveryTaskContainer`


was (Author: abudnik):
Steps to reproduce:
1. Add `::sleep(2);` at 
[https://github.com/apache/mesos/blob/2c41b039da20cf3904cee4272673d9e3791f5184/src/tests/containerizer/docker_containerizer_tests.cpp#L2743]
2. Recompile and run `sudo make check 
GTEST_FILTER=DockerContainerizerTest.ROOT_DOCKER_SlaveRecoveryTaskContainer`

> Mesos.DockerContainerizerTest.ROOT_DOCKER_SlaveRecoveryTaskContainer is flaky.
> --
>
> Key: MESOS-8258
> URL: https://issues.apache.org/jira/browse/MESOS-8258
> Project: Mesos
>  Issue Type: Bug
>  Components: test
> Environment: Ubuntu 16.04
> Ubuntu 17.04
> Debian 9
>Reporter: Alexander Rukletsov
>Priority: Major
>  Labels: flaky-test
> Attachments: ROOT_DOCKER_SlaveRecoveryTaskContainer-badrun.txt, 
> ROOT_DOCKER_SlaveRecoveryTaskContainer-badrun2.txt, 
> ROOT_DOCKER_SlaveRecoveryTaskContainer-goodrun.txt
>
>
> {noformat}
> /home/ubuntu/workspace/mesos/Mesos_CI-build/FLAG/CMake/label/mesos-ec2-ubuntu-17.04/mesos/src/tests/containerizer/docker_containerizer_tests.cpp:2772
>   Expected: 1
> To be equal to: reregister.updates_size()
>   Which is: 2
> {noformat}
> Full log attached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8258) Mesos.DockerContainerizerTest.ROOT_DOCKER_SlaveRecoveryTaskContainer is flaky.

2018-03-20 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16406290#comment-16406290
 ] 

Andrei Budnik commented on MESOS-8258:
--

Steps to reproduce:
1. Add `::sleep(2);` at 
[https://github.com/apache/mesos/blob/2c41b039da20cf3904cee4272673d9e3791f5184/src/tests/containerizer/docker_containerizer_tests.cpp#L2743]
2. Recompile and run `sudo make check 
GTEST_FILTER=DockerContainerizerTest.ROOT_DOCKER_SlaveRecoveryTaskContainer`

> Mesos.DockerContainerizerTest.ROOT_DOCKER_SlaveRecoveryTaskContainer is flaky.
> --
>
> Key: MESOS-8258
> URL: https://issues.apache.org/jira/browse/MESOS-8258
> Project: Mesos
>  Issue Type: Bug
>  Components: test
> Environment: Ubuntu 16.04
> Ubuntu 17.04
> Debian 9
>Reporter: Alexander Rukletsov
>Priority: Major
>  Labels: flaky-test
> Attachments: ROOT_DOCKER_SlaveRecoveryTaskContainer-badrun.txt, 
> ROOT_DOCKER_SlaveRecoveryTaskContainer-badrun2.txt, 
> ROOT_DOCKER_SlaveRecoveryTaskContainer-goodrun.txt
>
>
> {noformat}
> /home/ubuntu/workspace/mesos/Mesos_CI-build/FLAG/CMake/label/mesos-ec2-ubuntu-17.04/mesos/src/tests/containerizer/docker_containerizer_tests.cpp:2772
>   Expected: 1
> To be equal to: reregister.updates_size()
>   Which is: 2
> {noformat}
> Full log attached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8545) AgentAPIStreamingTest.AttachInputToNestedContainerSession is flaky.

2018-03-19 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16405054#comment-16405054
 ] 

Andrei Budnik commented on MESOS-8545:
--

Steps to reproduce:
 1. Add `::sleep(1)` before [sending http 
response|https://github.com/apache/mesos/blob/95bbe784da51b3a7eaeb9127e2541ea0b2af07b5/3rdparty/libprocess/src/http.cpp#L1741]
 to a socket.
 2. Recompile and run: `make check 
GTEST_FILTER=ContentType/AgentAPIStreamingTest.AttachInputToNestedContainerSession/0`

> AgentAPIStreamingTest.AttachInputToNestedContainerSession is flaky.
> ---
>
> Key: MESOS-8545
> URL: https://issues.apache.org/jira/browse/MESOS-8545
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Affects Versions: 1.5.0
>Reporter: Andrei Budnik
>Assignee: Andrei Budnik
>Priority: Major
>  Labels: Mesosphere, flaky-test
> Attachments: 
> AgentAPIStreamingTest.AttachInputToNestedContainerSession-badrun.txt, 
> AgentAPIStreamingTest.AttachInputToNestedContainerSession-badrun2.txt
>
>
> {code:java}
> I0205 17:11:01.091872 4898 http_proxy.cpp:132] Returning '500 Internal Server 
> Error' for '/slave(974)/api/v1' (Disconnected)
> /home/centos/workspace/mesos/Mesos_CI-build/FLAG/CMake/label/mesos-ec2-centos-7/mesos/src/tests/api_tests.cpp:6596:
>  Failure
> Value of: (response).get().status
> Actual: "500 Internal Server Error"
> Expected: http::OK().status
> Which is: "200 OK"
> Body: "Disconnected"
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-5886) FUTURE_DISPATCH may react on irrelevant dispatch.

2018-03-12 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16395580#comment-16395580
 ] 

Andrei Budnik commented on MESOS-5886:
--

If a class contains methods with the same signature, we can add one extra 
argument of a unique type for each of these methods. We can introduce a special 
type `Tag` for extra arguments:
{code:java}
struct None {};

template 
struct Tag { Tag(const None&) {} };

class Example : public Process {
public:
  void m1(int, Tag<1> tag = None()) {}
  void m2(int, Tag<2> tag = None()) {}
  void m3(int, Tag<3> tag = None()) {}
};{code}
It looks like current implementation of 
[`dispatch()`|https://github.com/apache/mesos/blob/8adb5fcb1f6c451bc9ad7ecdc6e39bc170fdcd65/3rdparty/libprocess/include/process/dispatch.hpp#L209-L256]
 supports arguments with a default value, so a user of `dispatch()` doesn't 
need to care about the extra tag argument.
Note, that this approach fixes the issue only for non-virtual methods. For more 
details, see [On fixing 
FUTURE_DISPATCH|https://docs.google.com/document/d/1d4nYfIWuTGtvHObolyzCwBttU75gxbKSvyISjHQ76Nw]
 document.

 

> FUTURE_DISPATCH may react on irrelevant dispatch.
> -
>
> Key: MESOS-5886
> URL: https://issues.apache.org/jira/browse/MESOS-5886
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.1.2, 1.2.1, 1.3.0, 1.4.0
>Reporter: Alexander Rukletsov
>Priority: Major
>  Labels: mesosphere, tech-debt, tech-debt-test
>
> [{{FUTURE_DISPATCH}}|https://github.com/apache/mesos/blob/e8ebbe5fe4189ef7ab046da2276a6abee41deeb2/3rdparty/libprocess/include/process/gmock.hpp#L50]
>  uses 
> [{{DispatchMatcher}}|https://github.com/apache/mesos/blob/e8ebbe5fe4189ef7ab046da2276a6abee41deeb2/3rdparty/libprocess/include/process/gmock.hpp#L350]
>  to figure out whether a processed {{DispatchEvent}} is the same the user is 
> waiting for. However, comparing {{std::type_info}} of function pointers is 
> not enough: different class methods with same signatures will be matched. 
> Here is the test that proves this:
> {noformat}
> class DispatchProcess : public Process
> {
> public:
>   MOCK_METHOD0(func0, void());
>   MOCK_METHOD1(func1, bool(bool));
>   MOCK_METHOD1(func1_same_but_different, bool(bool));
>   MOCK_METHOD1(func2, Future(bool));
>   MOCK_METHOD1(func3, int(int));
>   MOCK_METHOD2(func4, Future(bool, int));
> };
> {noformat}
> {noformat}
> TEST(ProcessTest, DispatchMatch)
> {
>   DispatchProcess process;
>   PID pid = spawn();
>   Future future = FUTURE_DISPATCH(
>   pid,
>   ::func1_same_but_different);
>   EXPECT_CALL(process, func1(_))
> .WillOnce(ReturnArg<0>());
>   dispatch(pid, ::func1, true);
>   AWAIT_READY(future);
>   terminate(pid);
>   wait(pid);
> }
> {noformat}
> The test passes:
> {noformat}
> [ RUN  ] ProcessTest.DispatchMatch
> [   OK ] ProcessTest.DispatchMatch (1 ms)
> {noformat}
> This change was introduced in https://reviews.apache.org/r/28052/.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-8605) Terminal task status update will not send if 'docker inspect' is hung

2018-02-28 Thread Andrei Budnik (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrei Budnik reassigned MESOS-8605:


Assignee: Andrei Budnik

> Terminal task status update will not send if 'docker inspect' is hung
> -
>
> Key: MESOS-8605
> URL: https://issues.apache.org/jira/browse/MESOS-8605
> Project: Mesos
>  Issue Type: Bug
>  Components: docker
>Affects Versions: 1.5.0
>Reporter: Greg Mann
>Assignee: Andrei Budnik
>Priority: Major
>  Labels: mesosphere
>
> When the agent processes a terminal status update for a task, it calls 
> {{containerizer->update()}} on the container before it forwards the update: 
> https://github.com/apache/mesos/blob/9635d4a2d12fc77935c3d5d166469258634c6b7e/src/slave/slave.cpp#L5509-L5514
> In the Docker containerizer, {{update()}} calls {{Docker::inspect()}}, which 
> means that if the inspect call hangs, the terminal update will not be sent: 
> https://github.com/apache/mesos/blob/9635d4a2d12fc77935c3d5d166469258634c6b7e/src/slave/containerizer/docker.cpp#L1714



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-8596) Add Docker inspect timeout when collecting container statistics

2018-02-28 Thread Andrei Budnik (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrei Budnik reassigned MESOS-8596:


Assignee: (was: Andrei Budnik)

> Add Docker inspect timeout when collecting container statistics
> ---
>
> Key: MESOS-8596
> URL: https://issues.apache.org/jira/browse/MESOS-8596
> Project: Mesos
>  Issue Type: Improvement
>Affects Versions: 1.5.0
>Reporter: Greg Mann
>Priority: Major
>  Labels: mesosphere
>
> In cases where the Docker daemon is hung, many processes calling {{docker 
> inspect}} may accumulate due to repeated attempts to collect the container 
> statistics of Docker containers. We should add a timeout to these 
> {{Docker::inspect()}} calls to avoid this accumulation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-8596) Add Docker inspect timeout when collecting container statistics

2018-02-28 Thread Andrei Budnik (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrei Budnik reassigned MESOS-8596:


Assignee: Andrei Budnik

> Add Docker inspect timeout when collecting container statistics
> ---
>
> Key: MESOS-8596
> URL: https://issues.apache.org/jira/browse/MESOS-8596
> Project: Mesos
>  Issue Type: Improvement
>Affects Versions: 1.5.0
>Reporter: Greg Mann
>Assignee: Andrei Budnik
>Priority: Major
>  Labels: mesosphere
>
> In cases where the Docker daemon is hung, many processes calling {{docker 
> inspect}} may accumulate due to repeated attempts to collect the container 
> statistics of Docker containers. We should add a timeout to these 
> {{Docker::inspect()}} calls to avoid this accumulation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-8574) Docker executor makes no progress when 'docker inspect' hangs

2018-02-22 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16370135#comment-16370135
 ] 

Andrei Budnik edited comment on MESOS-8574 at 2/22/18 8:32 PM:
---

https://reviews.apache.org/r/65713/
https://reviews.apache.org/r/65759/


was (Author: abudnik):
[https://reviews.apache.org/r/65713/
https://reviews.apache.org/r/65759/
|https://reviews.apache.org/r/65713/]

> Docker executor makes no progress when 'docker inspect' hangs
> -
>
> Key: MESOS-8574
> URL: https://issues.apache.org/jira/browse/MESOS-8574
> Project: Mesos
>  Issue Type: Improvement
>  Components: docker, executor
>Affects Versions: 1.5.0
>Reporter: Greg Mann
>Assignee: Andrei Budnik
>Priority: Major
>  Labels: mesosphere
>
> In the Docker executor, many calls later in the executor's lifecycle are 
> gated on an initial {{docker inspect}} call returning: 
> https://github.com/apache/mesos/blob/bc6b61bca37752689cffa40a14c53ad89f24e8fc/src/docker/executor.cpp#L223
> If that first call to {{docker inspect}} never returns, the executor becomes 
> stuck in a state where it makes no progress and cannot be killed.
> It's tempting for the executor to simply commit suicide after a timeout, but 
> we must be careful of the case in which the executor's Docker container is 
> actually running successfully, but the Docker daemon is unresponsive. In such 
> a case, we do not want to send TASK_FAILED or TASK_KILLED if the task's 
> container is running successfully.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8578) `UpgradeTest.UpgradeAgentIntoHierarchicalRoleForNonHierarchicalRole` is flaky

2018-02-13 Thread Andrei Budnik (JIRA)
Andrei Budnik created MESOS-8578:


 Summary: 
`UpgradeTest.UpgradeAgentIntoHierarchicalRoleForNonHierarchicalRole` is flaky
 Key: MESOS-8578
 URL: https://issues.apache.org/jira/browse/MESOS-8578
 Project: Mesos
  Issue Type: Bug
 Environment: Debian 9 SSL GRPS
Reporter: Andrei Budnik
 Attachments: 
UpgradeTest.UpgradeAgentIntoHierarchicalRoleForNonHierarchicalRole-badrun.txt

{code:java}
../../src/tests/upgrade_tests.cpp:664
Failed to wait 15secs for offers
{code}
See logs in attachments.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-8577) Destroy nested container if `LAUNCH_NESTED_CONTAINER_SESSION` fails

2018-02-13 Thread Andrei Budnik (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrei Budnik reassigned MESOS-8577:


Assignee: Andrei Budnik

> Destroy nested container if `LAUNCH_NESTED_CONTAINER_SESSION` fails
> ---
>
> Key: MESOS-8577
> URL: https://issues.apache.org/jira/browse/MESOS-8577
> Project: Mesos
>  Issue Type: Bug
>Reporter: Andrei Budnik
>Assignee: Andrei Budnik
>Priority: Major
>  Labels: mesosphere
>
> Currently, if `attachContainerOutput()` 
> [fails|https://github.com/apache/mesos/blob/bc6b61bca37752689cffa40a14c53ad89f24e8fc/src/slave/http.cpp#L3550-L3552]
>  for `LAUNCH_NESTED_CONTAINER_SESSION` call, then we return HTTP 500 error to 
> the client, but we don't destroy the nested container.
>  However, we 
> [destroy|https://github.com/apache/mesos/blob/bc6b61bca37752689cffa40a14c53ad89f24e8fc/src/slave/http.cpp#L3607-L3612]
>  a nested container if `attachContainerOutput()` returns a failure.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8577) Destroy nested container if `LAUNCH_NESTED_CONTAINER_SESSION` fails

2018-02-13 Thread Andrei Budnik (JIRA)
Andrei Budnik created MESOS-8577:


 Summary: Destroy nested container if 
`LAUNCH_NESTED_CONTAINER_SESSION` fails
 Key: MESOS-8577
 URL: https://issues.apache.org/jira/browse/MESOS-8577
 Project: Mesos
  Issue Type: Bug
Reporter: Andrei Budnik


Currently, if `attachContainerOutput()` 
[fails|https://github.com/apache/mesos/blob/bc6b61bca37752689cffa40a14c53ad89f24e8fc/src/slave/http.cpp#L3550-L3552]
 for `LAUNCH_NESTED_CONTAINER_SESSION` call, then we return HTTP 500 error to 
the client, but we don't destroy the nested container.
 However, we 
[destroy|https://github.com/apache/mesos/blob/bc6b61bca37752689cffa40a14c53ad89f24e8fc/src/slave/http.cpp#L3607-L3612]
 a nested container if `attachContainerOutput()` returns a failure.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8550) Bug in `Master::detected()` leads to coredump in `MasterZooKeeperTest.MasterInfoAddress`

2018-02-12 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16361235#comment-16361235
 ] 

Andrei Budnik commented on MESOS-8550:
--

[~bennoe] Did you find a shepherd for this?

> Bug in `Master::detected()` leads to coredump in 
> `MasterZooKeeperTest.MasterInfoAddress`
> 
>
> Key: MESOS-8550
> URL: https://issues.apache.org/jira/browse/MESOS-8550
> Project: Mesos
>  Issue Type: Bug
>  Components: leader election, master
>Reporter: Andrei Budnik
>Assignee: Benno Evers
>Priority: Major
> Attachments: MasterZooKeeperTest.MasterInfoAddress-badrun.txt
>
>
> {code:java}
> 15:55:17 Assertion failed: (isSome()), function get, file 
> ../../3rdparty/stout/include/stout/option.hpp, line 119.
> 15:55:17 *** Aborted at 1518018924 (unix time) try "date -d @1518018924" if 
> you are using GNU date ***
> 15:55:17 PC: @ 0x7fff4f8f2e3e __pthread_kill
> 15:55:17 *** SIGABRT (@0x7fff4f8f2e3e) received by PID 39896 (TID 
> 0x70427000) stack trace: ***
> 15:55:17 @ 0x7fff4fa24f5a _sigtramp
> 15:55:17 I0207 07:55:24.945252 4890624 group.cpp:511] ZooKeeper session 
> expired
> 15:55:17 @ 0x70425500 (unknown)
> 15:55:17 2018-02-07 07:55:24,945:39896(0x70633000):ZOO_INFO@log_env@794: 
> Client 
> environment:user.dir=/private/var/folders/6w/rw03zh013y38ys6cyn8qppf8gn/T/1mHCvU
> 15:55:17 @ 0x7fff4f84f312 abort
> 15:55:17 2018-02-07 
> 07:55:24,945:39896(0x70633000):ZOO_INFO@zookeeper_init@827: Initiating 
> client connection, host=127.0.0.1:52197 sessionTimeout=1 
> watcher=0x10d916590 sessionId=0 sessionPasswd= context=0x7fe1bda706a0 
> flags=0
> 15:55:17 @ 0x7fff4f817368 __assert_rtn
> 15:55:17 @0x10b9cff97 _ZNR6OptionIN5mesos10MasterInfoEE3getEv
> 15:55:17 @0x10bbb04b5 Option<>::operator->()
> 15:55:17 @0x10bd4514a mesos::internal::master::Master::detected()
> 15:55:17 @0x10bf54558 
> _ZZN7process8dispatchIN5mesos8internal6master6MasterERKNS_6FutureI6OptionINS1_10MasterInfoSB_EEvRKNS_3PIDIT_EEMSD_FvT0_EOT1_ENKUlOS9_PNS_11ProcessBaseEE_clESM_SO_
> 15:55:17 @0x10bf54310 
> _ZN5cpp176invokeIZN7process8dispatchIN5mesos8internal6master6MasterERKNS1_6FutureI6OptionINS3_10MasterInfoSD_EEvRKNS1_3PIDIT_EEMSF_FvT0_EOT1_EUlOSB_PNS1_11ProcessBaseEE_JSB_SQ_EEEDTclclsr3stdE7forwardISF_Efp_Espclsr3stdE7forwardIT0_Efp0_EEEOSF_DpOSS_
> 15:55:17 @0x10bf542bb 
> _ZN6lambda8internal7PartialIZN7process8dispatchIN5mesos8internal6master6MasterERKNS2_6FutureI6OptionINS4_10MasterInfoSE_EEvRKNS2_3PIDIT_EEMSG_FvT0_EOT1_EUlOSC_PNS2_11ProcessBaseEE_JSC_NSt3__112placeholders4__phILi1E13invoke_expandISS_NST_5tupleIJSC_SW_EEENSZ_IJOSR_EEEJLm0ELm1DTclsr5cpp17E6invokeclsr3stdE7forwardISG_Efp_Espcl6expandclsr3stdE3getIXT2_EEclsr3stdE7forwardISK_Efp0_EEclsr3stdE7forwardISN_Efp2_OSG_OSK_N5cpp1416integer_sequenceImJXspT2_SO_
> 15:55:17 @0x10bf541f3 
> _ZNO6lambda8internal7PartialIZN7process8dispatchIN5mesos8internal6master6MasterERKNS2_6FutureI6OptionINS4_10MasterInfoSE_EEvRKNS2_3PIDIT_EEMSG_FvT0_EOT1_EUlOSC_PNS2_11ProcessBaseEE_JSC_NSt3__112placeholders4__phILi1EclIJSR_EEEDTcl13invoke_expandclL_ZNST_4moveIRSS_EEONST_16remove_referenceISG_E4typeEOSG_EdtdefpT1fEclL_ZNSZ_IRNST_5tupleIJSC_SW_ES14_S15_EdtdefpT10bound_argsEcvN5cpp1416integer_sequenceImJLm0ELm1_Eclsr3stdE16forward_as_tuplespclsr3stdE7forwardIT_Efp_DpOS1C_
> 15:55:17 @0x10bf540bd 
> _ZN5cpp176invokeIN6lambda8internal7PartialIZN7process8dispatchIN5mesos8internal6master6MasterERKNS4_6FutureI6OptionINS6_10MasterInfoSG_EEvRKNS4_3PIDIT_EEMSI_FvT0_EOT1_EUlOSE_PNS4_11ProcessBaseEE_JSE_NSt3__112placeholders4__phILi1EEJST_EEEDTclclsr3stdE7forwardISI_Efp_Espclsr3stdE7forwardIT0_Efp0_EEEOSI_DpOS10_
> 15:55:17 @0x10bf54081 
> _ZN6lambda8internal6InvokeIvEclINS0_7PartialIZN7process8dispatchIN5mesos8internal6master6MasterERKNS5_6FutureI6OptionINS7_10MasterInfoSH_EEvRKNS5_3PIDIT_EEMSJ_FvT0_EOT1_EUlOSF_PNS5_11ProcessBaseEE_JSF_NSt3__112placeholders4__phILi1EEJSU_EEEvOSJ_DpOT0_
> 15:55:17 @0x10bf53e06 
> _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8dispatchIN5mesos8internal6master6MasterERKNS1_6FutureI6OptionINSA_10MasterInfoSK_EEvRKNS1_3PIDIT_EEMSM_FvT0_EOT1_EUlOSI_S3_E_JSI_NSt3__112placeholders4__phILi1EEEclEOS3_
> 15:55:17 @0x10ebf464f 
> _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEEclES3_
> 15:55:17 @0x10ebf44c4 process::ProcessBase::consume()
> 15:55:17 @0x10ec6f4d9 
> _ZNO7process13DispatchEvent7consumeEPNS_13EventConsumerE
> 15:55:17 @0x10b0b2389 

[jira] [Created] (MESOS-8568) Command checks should always call `WAIT_NESTED_CONTAINER` before `REMOVE_NESTED_CONTAINER`

2018-02-12 Thread Andrei Budnik (JIRA)
Andrei Budnik created MESOS-8568:


 Summary: Command checks should always call `WAIT_NESTED_CONTAINER` 
before `REMOVE_NESTED_CONTAINER`
 Key: MESOS-8568
 URL: https://issues.apache.org/jira/browse/MESOS-8568
 Project: Mesos
  Issue Type: Task
Reporter: Andrei Budnik


After successful launch of a nested container via 
`LAUNCH_NESTED_CONTAINER_SESSION` in a checker library, it calls 
[waitNestedContainer 
|https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L657]
 for the container. Checker library 
[calls|https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L466-L487]
 `REMOVE_NESTED_CONTAINER` to remove a previous nested container before 
launching a nested container for a subsequent check. Hence, 
`REMOVE_NESTED_CONTAINER` call follows `WAIT_NESTED_CONTAINER` to ensure that 
the nested container has been terminated and can be removed/cleaned up.

In case of failure, the library [doesn't 
call|https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L627-L636]
 `WAIT_NESTED_CONTAINER`. Despite the failure, the container might be launched 
and the following attempt to remove the container without call 
`WAIT_NESTED_CONTAINER` leads to errors like:
{code:java}
W0202 20:03:08.895830 7 checker_process.cpp:503] Received '500 Internal Server 
Error' (Nested container has not terminated yet) while removing the nested 
container 
'2b0c542c-1f5f-42f7-b914-2c1cadb4aeca.da0a7cca-516c-4ec9-b215-b34412b670fa.check-49adc5f1-37a3-4f26-8708-e27d2d6cd125'
 used for the COMMAND check for task 
'node-0-server__e26a82b0-fbab-46a0-a1ea-e7ac6cfa4c91
{code}

The checker library should always call `WAIT_NESTED_CONTAINER` before 
`REMOVE_NESTED_CONTAINER`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-8550) Bug in `Master::detected()` leads to coredump in `MasterZooKeeperTest.MasterInfoAddress`

2018-02-08 Thread Andrei Budnik (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrei Budnik reassigned MESOS-8550:


Assignee: Benno Evers

> Bug in `Master::detected()` leads to coredump in 
> `MasterZooKeeperTest.MasterInfoAddress`
> 
>
> Key: MESOS-8550
> URL: https://issues.apache.org/jira/browse/MESOS-8550
> Project: Mesos
>  Issue Type: Bug
>  Components: leader election, master
>Reporter: Andrei Budnik
>Assignee: Benno Evers
>Priority: Major
> Attachments: MasterZooKeeperTest.MasterInfoAddress-badrun.txt
>
>
> {code:java}
> 15:55:17 Assertion failed: (isSome()), function get, file 
> ../../3rdparty/stout/include/stout/option.hpp, line 119.
> 15:55:17 *** Aborted at 1518018924 (unix time) try "date -d @1518018924" if 
> you are using GNU date ***
> 15:55:17 PC: @ 0x7fff4f8f2e3e __pthread_kill
> 15:55:17 *** SIGABRT (@0x7fff4f8f2e3e) received by PID 39896 (TID 
> 0x70427000) stack trace: ***
> 15:55:17 @ 0x7fff4fa24f5a _sigtramp
> 15:55:17 I0207 07:55:24.945252 4890624 group.cpp:511] ZooKeeper session 
> expired
> 15:55:17 @ 0x70425500 (unknown)
> 15:55:17 2018-02-07 07:55:24,945:39896(0x70633000):ZOO_INFO@log_env@794: 
> Client 
> environment:user.dir=/private/var/folders/6w/rw03zh013y38ys6cyn8qppf8gn/T/1mHCvU
> 15:55:17 @ 0x7fff4f84f312 abort
> 15:55:17 2018-02-07 
> 07:55:24,945:39896(0x70633000):ZOO_INFO@zookeeper_init@827: Initiating 
> client connection, host=127.0.0.1:52197 sessionTimeout=1 
> watcher=0x10d916590 sessionId=0 sessionPasswd= context=0x7fe1bda706a0 
> flags=0
> 15:55:17 @ 0x7fff4f817368 __assert_rtn
> 15:55:17 @0x10b9cff97 _ZNR6OptionIN5mesos10MasterInfoEE3getEv
> 15:55:17 @0x10bbb04b5 Option<>::operator->()
> 15:55:17 @0x10bd4514a mesos::internal::master::Master::detected()
> 15:55:17 @0x10bf54558 
> _ZZN7process8dispatchIN5mesos8internal6master6MasterERKNS_6FutureI6OptionINS1_10MasterInfoSB_EEvRKNS_3PIDIT_EEMSD_FvT0_EOT1_ENKUlOS9_PNS_11ProcessBaseEE_clESM_SO_
> 15:55:17 @0x10bf54310 
> _ZN5cpp176invokeIZN7process8dispatchIN5mesos8internal6master6MasterERKNS1_6FutureI6OptionINS3_10MasterInfoSD_EEvRKNS1_3PIDIT_EEMSF_FvT0_EOT1_EUlOSB_PNS1_11ProcessBaseEE_JSB_SQ_EEEDTclclsr3stdE7forwardISF_Efp_Espclsr3stdE7forwardIT0_Efp0_EEEOSF_DpOSS_
> 15:55:17 @0x10bf542bb 
> _ZN6lambda8internal7PartialIZN7process8dispatchIN5mesos8internal6master6MasterERKNS2_6FutureI6OptionINS4_10MasterInfoSE_EEvRKNS2_3PIDIT_EEMSG_FvT0_EOT1_EUlOSC_PNS2_11ProcessBaseEE_JSC_NSt3__112placeholders4__phILi1E13invoke_expandISS_NST_5tupleIJSC_SW_EEENSZ_IJOSR_EEEJLm0ELm1DTclsr5cpp17E6invokeclsr3stdE7forwardISG_Efp_Espcl6expandclsr3stdE3getIXT2_EEclsr3stdE7forwardISK_Efp0_EEclsr3stdE7forwardISN_Efp2_OSG_OSK_N5cpp1416integer_sequenceImJXspT2_SO_
> 15:55:17 @0x10bf541f3 
> _ZNO6lambda8internal7PartialIZN7process8dispatchIN5mesos8internal6master6MasterERKNS2_6FutureI6OptionINS4_10MasterInfoSE_EEvRKNS2_3PIDIT_EEMSG_FvT0_EOT1_EUlOSC_PNS2_11ProcessBaseEE_JSC_NSt3__112placeholders4__phILi1EclIJSR_EEEDTcl13invoke_expandclL_ZNST_4moveIRSS_EEONST_16remove_referenceISG_E4typeEOSG_EdtdefpT1fEclL_ZNSZ_IRNST_5tupleIJSC_SW_ES14_S15_EdtdefpT10bound_argsEcvN5cpp1416integer_sequenceImJLm0ELm1_Eclsr3stdE16forward_as_tuplespclsr3stdE7forwardIT_Efp_DpOS1C_
> 15:55:17 @0x10bf540bd 
> _ZN5cpp176invokeIN6lambda8internal7PartialIZN7process8dispatchIN5mesos8internal6master6MasterERKNS4_6FutureI6OptionINS6_10MasterInfoSG_EEvRKNS4_3PIDIT_EEMSI_FvT0_EOT1_EUlOSE_PNS4_11ProcessBaseEE_JSE_NSt3__112placeholders4__phILi1EEJST_EEEDTclclsr3stdE7forwardISI_Efp_Espclsr3stdE7forwardIT0_Efp0_EEEOSI_DpOS10_
> 15:55:17 @0x10bf54081 
> _ZN6lambda8internal6InvokeIvEclINS0_7PartialIZN7process8dispatchIN5mesos8internal6master6MasterERKNS5_6FutureI6OptionINS7_10MasterInfoSH_EEvRKNS5_3PIDIT_EEMSJ_FvT0_EOT1_EUlOSF_PNS5_11ProcessBaseEE_JSF_NSt3__112placeholders4__phILi1EEJSU_EEEvOSJ_DpOT0_
> 15:55:17 @0x10bf53e06 
> _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8dispatchIN5mesos8internal6master6MasterERKNS1_6FutureI6OptionINSA_10MasterInfoSK_EEvRKNS1_3PIDIT_EEMSM_FvT0_EOT1_EUlOSI_S3_E_JSI_NSt3__112placeholders4__phILi1EEEclEOS3_
> 15:55:17 @0x10ebf464f 
> _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEEclES3_
> 15:55:17 @0x10ebf44c4 process::ProcessBase::consume()
> 15:55:17 @0x10ec6f4d9 
> _ZNO7process13DispatchEvent7consumeEPNS_13EventConsumerE
> 15:55:17 @0x10b0b2389 process::ProcessBase::serve()
> 15:55:17 @0x10ebe 

[jira] [Issue Comment Deleted] (MESOS-8550) Bug in `Master::detected()` leads to coredump in `MasterZooKeeperTest.MasterInfoAddress`

2018-02-08 Thread Andrei Budnik (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrei Budnik updated MESOS-8550:
-
Comment: was deleted

(was: [https://reviews.apache.org/r/65571/])

> Bug in `Master::detected()` leads to coredump in 
> `MasterZooKeeperTest.MasterInfoAddress`
> 
>
> Key: MESOS-8550
> URL: https://issues.apache.org/jira/browse/MESOS-8550
> Project: Mesos
>  Issue Type: Bug
>  Components: leader election, master
>Reporter: Andrei Budnik
>Assignee: Benno Evers
>Priority: Major
> Attachments: MasterZooKeeperTest.MasterInfoAddress-badrun.txt
>
>
> {code:java}
> 15:55:17 Assertion failed: (isSome()), function get, file 
> ../../3rdparty/stout/include/stout/option.hpp, line 119.
> 15:55:17 *** Aborted at 1518018924 (unix time) try "date -d @1518018924" if 
> you are using GNU date ***
> 15:55:17 PC: @ 0x7fff4f8f2e3e __pthread_kill
> 15:55:17 *** SIGABRT (@0x7fff4f8f2e3e) received by PID 39896 (TID 
> 0x70427000) stack trace: ***
> 15:55:17 @ 0x7fff4fa24f5a _sigtramp
> 15:55:17 I0207 07:55:24.945252 4890624 group.cpp:511] ZooKeeper session 
> expired
> 15:55:17 @ 0x70425500 (unknown)
> 15:55:17 2018-02-07 07:55:24,945:39896(0x70633000):ZOO_INFO@log_env@794: 
> Client 
> environment:user.dir=/private/var/folders/6w/rw03zh013y38ys6cyn8qppf8gn/T/1mHCvU
> 15:55:17 @ 0x7fff4f84f312 abort
> 15:55:17 2018-02-07 
> 07:55:24,945:39896(0x70633000):ZOO_INFO@zookeeper_init@827: Initiating 
> client connection, host=127.0.0.1:52197 sessionTimeout=1 
> watcher=0x10d916590 sessionId=0 sessionPasswd= context=0x7fe1bda706a0 
> flags=0
> 15:55:17 @ 0x7fff4f817368 __assert_rtn
> 15:55:17 @0x10b9cff97 _ZNR6OptionIN5mesos10MasterInfoEE3getEv
> 15:55:17 @0x10bbb04b5 Option<>::operator->()
> 15:55:17 @0x10bd4514a mesos::internal::master::Master::detected()
> 15:55:17 @0x10bf54558 
> _ZZN7process8dispatchIN5mesos8internal6master6MasterERKNS_6FutureI6OptionINS1_10MasterInfoSB_EEvRKNS_3PIDIT_EEMSD_FvT0_EOT1_ENKUlOS9_PNS_11ProcessBaseEE_clESM_SO_
> 15:55:17 @0x10bf54310 
> _ZN5cpp176invokeIZN7process8dispatchIN5mesos8internal6master6MasterERKNS1_6FutureI6OptionINS3_10MasterInfoSD_EEvRKNS1_3PIDIT_EEMSF_FvT0_EOT1_EUlOSB_PNS1_11ProcessBaseEE_JSB_SQ_EEEDTclclsr3stdE7forwardISF_Efp_Espclsr3stdE7forwardIT0_Efp0_EEEOSF_DpOSS_
> 15:55:17 @0x10bf542bb 
> _ZN6lambda8internal7PartialIZN7process8dispatchIN5mesos8internal6master6MasterERKNS2_6FutureI6OptionINS4_10MasterInfoSE_EEvRKNS2_3PIDIT_EEMSG_FvT0_EOT1_EUlOSC_PNS2_11ProcessBaseEE_JSC_NSt3__112placeholders4__phILi1E13invoke_expandISS_NST_5tupleIJSC_SW_EEENSZ_IJOSR_EEEJLm0ELm1DTclsr5cpp17E6invokeclsr3stdE7forwardISG_Efp_Espcl6expandclsr3stdE3getIXT2_EEclsr3stdE7forwardISK_Efp0_EEclsr3stdE7forwardISN_Efp2_OSG_OSK_N5cpp1416integer_sequenceImJXspT2_SO_
> 15:55:17 @0x10bf541f3 
> _ZNO6lambda8internal7PartialIZN7process8dispatchIN5mesos8internal6master6MasterERKNS2_6FutureI6OptionINS4_10MasterInfoSE_EEvRKNS2_3PIDIT_EEMSG_FvT0_EOT1_EUlOSC_PNS2_11ProcessBaseEE_JSC_NSt3__112placeholders4__phILi1EclIJSR_EEEDTcl13invoke_expandclL_ZNST_4moveIRSS_EEONST_16remove_referenceISG_E4typeEOSG_EdtdefpT1fEclL_ZNSZ_IRNST_5tupleIJSC_SW_ES14_S15_EdtdefpT10bound_argsEcvN5cpp1416integer_sequenceImJLm0ELm1_Eclsr3stdE16forward_as_tuplespclsr3stdE7forwardIT_Efp_DpOS1C_
> 15:55:17 @0x10bf540bd 
> _ZN5cpp176invokeIN6lambda8internal7PartialIZN7process8dispatchIN5mesos8internal6master6MasterERKNS4_6FutureI6OptionINS6_10MasterInfoSG_EEvRKNS4_3PIDIT_EEMSI_FvT0_EOT1_EUlOSE_PNS4_11ProcessBaseEE_JSE_NSt3__112placeholders4__phILi1EEJST_EEEDTclclsr3stdE7forwardISI_Efp_Espclsr3stdE7forwardIT0_Efp0_EEEOSI_DpOS10_
> 15:55:17 @0x10bf54081 
> _ZN6lambda8internal6InvokeIvEclINS0_7PartialIZN7process8dispatchIN5mesos8internal6master6MasterERKNS5_6FutureI6OptionINS7_10MasterInfoSH_EEvRKNS5_3PIDIT_EEMSJ_FvT0_EOT1_EUlOSF_PNS5_11ProcessBaseEE_JSF_NSt3__112placeholders4__phILi1EEJSU_EEEvOSJ_DpOT0_
> 15:55:17 @0x10bf53e06 
> _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8dispatchIN5mesos8internal6master6MasterERKNS1_6FutureI6OptionINSA_10MasterInfoSK_EEvRKNS1_3PIDIT_EEMSM_FvT0_EOT1_EUlOSI_S3_E_JSI_NSt3__112placeholders4__phILi1EEEclEOS3_
> 15:55:17 @0x10ebf464f 
> _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEEclES3_
> 15:55:17 @0x10ebf44c4 process::ProcessBase::consume()
> 15:55:17 @0x10ec6f4d9 
> _ZNO7process13DispatchEvent7consumeEPNS_13EventConsumerE
> 15:55:17 @0x10b0b2389 process::ProcessBase::serve()
> 

[jira] [Created] (MESOS-8553) Implement a test to reproduce a bug in launch nested container call.

2018-02-08 Thread Andrei Budnik (JIRA)
Andrei Budnik created MESOS-8553:


 Summary: Implement a test to reproduce a bug in launch nested 
container call.
 Key: MESOS-8553
 URL: https://issues.apache.org/jira/browse/MESOS-8553
 Project: Mesos
  Issue Type: Task
  Components: test
Reporter: Andrei Budnik


It's known that in some circumstances an attempt to launch a nested container 
session might fail with the following error message:
{code:java}
Failed to enter mount namespace: Failed to open '/proc/29473/ns/mnt': No such 
file or directory
{code}
That message is written by [linux 
launcher|https://github.com/apache/mesos/blob/f7dbd29bd9809d1dd254041537ca875e7ea26613/src/slave/containerizer/mesos/launch.cpp#L742-L743]
 to stdout. This bug is most likely caused by 
[getMountNamespaceTarget()|https://github.com/apache/mesos/blob/f7dbd29bd9809d1dd254041537ca875e7ea26613/src/slave/containerizer/mesos/utils.cpp#L59].

Steps for the test could be:
1) Start a long running task in its own container (e.g. `sleep 1000`)
2) Start a new short-living nested container via `LAUNCH_NESTED_CONTAINER` 
(e.g. `echo echo`)

3) Call `WAIT_NESTED_CONTAINER` on that nested container

4) Start long-living nested container via `LAUNCH_NESTED_CONTAINER` (e.g. `cat`)
5) Kill that nested container via `KILL_NESTED_CONTAINER`
6) Start another long-living nested container via 
`LAUNCH_NESTED_CONTAINER_SESSION` 

(e.g. `cat`)
7) Attach to that container via `ATTACH_CONTAINER_INPUT` and write non-empty 
message M to container's stdin
8) Check the output of the nested container: it should contain message M

The bug might pop up during step 8.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8552) Failing `CGROUPS_ROOT_PidNamespaceForward` and `CGROUPS_ROOT_PidNamespaceBackward`

2018-02-08 Thread Andrei Budnik (JIRA)
Andrei Budnik created MESOS-8552:


 Summary: Failing `CGROUPS_ROOT_PidNamespaceForward` and 
`CGROUPS_ROOT_PidNamespaceBackward`
 Key: MESOS-8552
 URL: https://issues.apache.org/jira/browse/MESOS-8552
 Project: Mesos
  Issue Type: Bug
Reporter: Andrei Budnik


{code:java}
W0208 04:41:06.970381 348 containerizer.cpp:2335] Attempted to destroy unknown 
container 001fdfaf-7dab-45b9-ab5c-baa3527d50fb
../../src/tests/slave_recovery_tests.cpp:5189: Failure
termination.get() is NONE
{code}
{code:java}
W0208 04:41:10.058873 348 containerizer.cpp:2335] Attempted to destroy unknown 
container e51afc11-cffe-4861-ba13-b124116522b0 
../../src/tests/slave_recovery_tests.cpp:5294: Failure termination.get() is NONE
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8550) Bug in `Master::detected()` leads to coredump in `MasterZooKeeperTest.MasterInfoAddress`

2018-02-07 Thread Andrei Budnik (JIRA)
Andrei Budnik created MESOS-8550:


 Summary: Bug in `Master::detected()` leads to coredump in 
`MasterZooKeeperTest.MasterInfoAddress`
 Key: MESOS-8550
 URL: https://issues.apache.org/jira/browse/MESOS-8550
 Project: Mesos
  Issue Type: Bug
  Components: leader election, master
Reporter: Andrei Budnik


{code:java}
15:55:17 Assertion failed: (isSome()), function get, file 
../../3rdparty/stout/include/stout/option.hpp, line 119.
15:55:17 *** Aborted at 1518018924 (unix time) try "date -d @1518018924" if you 
are using GNU date ***
15:55:17 PC: @ 0x7fff4f8f2e3e __pthread_kill
15:55:17 *** SIGABRT (@0x7fff4f8f2e3e) received by PID 39896 (TID 
0x70427000) stack trace: ***
15:55:17 @ 0x7fff4fa24f5a _sigtramp
15:55:17 I0207 07:55:24.945252 4890624 group.cpp:511] ZooKeeper session expired
15:55:17 @ 0x70425500 (unknown)
15:55:17 2018-02-07 07:55:24,945:39896(0x70633000):ZOO_INFO@log_env@794: 
Client 
environment:user.dir=/private/var/folders/6w/rw03zh013y38ys6cyn8qppf8gn/T/1mHCvU
15:55:17 @ 0x7fff4f84f312 abort
15:55:17 2018-02-07 
07:55:24,945:39896(0x70633000):ZOO_INFO@zookeeper_init@827: Initiating 
client connection, host=127.0.0.1:52197 sessionTimeout=1 
watcher=0x10d916590 sessionId=0 sessionPasswd= context=0x7fe1bda706a0 
flags=0
15:55:17 @ 0x7fff4f817368 __assert_rtn
15:55:17 @0x10b9cff97 _ZNR6OptionIN5mesos10MasterInfoEE3getEv
15:55:17 @0x10bbb04b5 Option<>::operator->()
15:55:17 @0x10bd4514a mesos::internal::master::Master::detected()
15:55:17 @0x10bf54558 
_ZZN7process8dispatchIN5mesos8internal6master6MasterERKNS_6FutureI6OptionINS1_10MasterInfoSB_EEvRKNS_3PIDIT_EEMSD_FvT0_EOT1_ENKUlOS9_PNS_11ProcessBaseEE_clESM_SO_
15:55:17 @0x10bf54310 
_ZN5cpp176invokeIZN7process8dispatchIN5mesos8internal6master6MasterERKNS1_6FutureI6OptionINS3_10MasterInfoSD_EEvRKNS1_3PIDIT_EEMSF_FvT0_EOT1_EUlOSB_PNS1_11ProcessBaseEE_JSB_SQ_EEEDTclclsr3stdE7forwardISF_Efp_Espclsr3stdE7forwardIT0_Efp0_EEEOSF_DpOSS_
15:55:17 @0x10bf542bb 
_ZN6lambda8internal7PartialIZN7process8dispatchIN5mesos8internal6master6MasterERKNS2_6FutureI6OptionINS4_10MasterInfoSE_EEvRKNS2_3PIDIT_EEMSG_FvT0_EOT1_EUlOSC_PNS2_11ProcessBaseEE_JSC_NSt3__112placeholders4__phILi1E13invoke_expandISS_NST_5tupleIJSC_SW_EEENSZ_IJOSR_EEEJLm0ELm1DTclsr5cpp17E6invokeclsr3stdE7forwardISG_Efp_Espcl6expandclsr3stdE3getIXT2_EEclsr3stdE7forwardISK_Efp0_EEclsr3stdE7forwardISN_Efp2_OSG_OSK_N5cpp1416integer_sequenceImJXspT2_SO_
15:55:17 @0x10bf541f3 
_ZNO6lambda8internal7PartialIZN7process8dispatchIN5mesos8internal6master6MasterERKNS2_6FutureI6OptionINS4_10MasterInfoSE_EEvRKNS2_3PIDIT_EEMSG_FvT0_EOT1_EUlOSC_PNS2_11ProcessBaseEE_JSC_NSt3__112placeholders4__phILi1EclIJSR_EEEDTcl13invoke_expandclL_ZNST_4moveIRSS_EEONST_16remove_referenceISG_E4typeEOSG_EdtdefpT1fEclL_ZNSZ_IRNST_5tupleIJSC_SW_ES14_S15_EdtdefpT10bound_argsEcvN5cpp1416integer_sequenceImJLm0ELm1_Eclsr3stdE16forward_as_tuplespclsr3stdE7forwardIT_Efp_DpOS1C_
15:55:17 @0x10bf540bd 
_ZN5cpp176invokeIN6lambda8internal7PartialIZN7process8dispatchIN5mesos8internal6master6MasterERKNS4_6FutureI6OptionINS6_10MasterInfoSG_EEvRKNS4_3PIDIT_EEMSI_FvT0_EOT1_EUlOSE_PNS4_11ProcessBaseEE_JSE_NSt3__112placeholders4__phILi1EEJST_EEEDTclclsr3stdE7forwardISI_Efp_Espclsr3stdE7forwardIT0_Efp0_EEEOSI_DpOS10_
15:55:17 @0x10bf54081 
_ZN6lambda8internal6InvokeIvEclINS0_7PartialIZN7process8dispatchIN5mesos8internal6master6MasterERKNS5_6FutureI6OptionINS7_10MasterInfoSH_EEvRKNS5_3PIDIT_EEMSJ_FvT0_EOT1_EUlOSF_PNS5_11ProcessBaseEE_JSF_NSt3__112placeholders4__phILi1EEJSU_EEEvOSJ_DpOT0_
15:55:17 @0x10bf53e06 
_ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8dispatchIN5mesos8internal6master6MasterERKNS1_6FutureI6OptionINSA_10MasterInfoSK_EEvRKNS1_3PIDIT_EEMSM_FvT0_EOT1_EUlOSI_S3_E_JSI_NSt3__112placeholders4__phILi1EEEclEOS3_
15:55:17 @0x10ebf464f 
_ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEEclES3_
15:55:17 @0x10ebf44c4 process::ProcessBase::consume()
15:55:17 @0x10ec6f4d9 
_ZNO7process13DispatchEvent7consumeEPNS_13EventConsumerE
15:55:17 @0x10b0b2389 process::ProcessBase::serve()
15:55:17 @0x10ebe process::ProcessManager::resume()
15:55:17 @0x10ecbd335 
process::ProcessManager::init_threads()::$_2::operator()()
15:55:17 @0x10ecbcee6 
_ZNSt3__114__thread_proxyINS_5tupleIJNS_10unique_ptrINS_15__thread_structENS_14default_deleteIS3_ZN7process14ProcessManager12init_threadsEvE3$_2EPvSB_
15:55:17 @ 0x7fff4fa2e6c1 _pthread_body
15:55:17 @ 0x7fff4fa2e56d _pthread_start
15:55:17 @ 0x7fff4fa2dc5d thread_start
{code}
This 

[jira] [Created] (MESOS-8545) AgentAPIStreamingTest.AttachInputToNestedContainerSession is flaky.

2018-02-05 Thread Andrei Budnik (JIRA)
Andrei Budnik created MESOS-8545:


 Summary: AgentAPIStreamingTest.AttachInputToNestedContainerSession 
is flaky.
 Key: MESOS-8545
 URL: https://issues.apache.org/jira/browse/MESOS-8545
 Project: Mesos
  Issue Type: Bug
  Components: agent
Affects Versions: 1.5.0
Reporter: Andrei Budnik
Assignee: Andrei Budnik


{code:java}
I0205 17:11:01.091872 4898 http_proxy.cpp:132] Returning '500 Internal Server 
Error' for '/slave(974)/api/v1' (Disconnected)
/home/centos/workspace/mesos/Mesos_CI-build/FLAG/CMake/label/mesos-ec2-centos-7/mesos/src/tests/api_tests.cpp:6596:
 Failure
Value of: (response).get().status
Actual: "500 Internal Server Error"
Expected: http::OK().status
Which is: "200 OK"
Body: "Disconnected"
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8485) MasterTest.RegistryGcByCount is flaky

2018-02-05 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16352682#comment-16352682
 ] 

Andrei Budnik commented on MESOS-8485:
--

[~bennoe] I was able to reproduce lots of bugs using `::sleep()` function by 
adding it in several places like:
1) 
https://issues.apache.org/jira/browse/MESOS-7504?focusedCommentId=16192797=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16192797
2) 
https://issues.apache.org/jira/browse/MESOS-7506?focusedCommentId=16243729=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16243729

If it's possible to stably reproduce the issue, then your fix can be easily 
verified.

> MasterTest.RegistryGcByCount is flaky
> -
>
> Key: MESOS-8485
> URL: https://issues.apache.org/jira/browse/MESOS-8485
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Affects Versions: 1.5.0
>Reporter: Vinod Kone
>Assignee: Benno Evers
>Priority: Major
>  Labels: flaky-test
>
> Observed this while testing Mesos 1.5.0-rc1 in ASF CI.
>  
> {code}
> 3: [ RUN      ] MasterTest.RegistryGcByCount
> ..snip...
> 3: I0123 19:22:05.929347 15994 slave.cpp:1201] Detecting new master
> 3: I0123 19:22:05.931701 15988 slave.cpp:1228] Authenticating with master 
> master@172.17.0.2:45634
> 3: I0123 19:22:05.931838 15988 slave.cpp:1237] Using default CRAM-MD5 
> authenticatee
> 3: I0123 19:22:05.932153 15999 authenticatee.cpp:121] Creating new client 
> SASL connection
> 3: I0123 19:22:05.932580 15992 master.cpp:8958] Authenticating 
> slave(442)@172.17.0.2:45634
> 3: I0123 19:22:05.932822 15990 authenticator.cpp:414] Starting authentication 
> session for crammd5-authenticatee(870)@172.17.0.2:45634
> 3: I0123 19:22:05.933163 15989 authenticator.cpp:98] Creating new server SASL 
> connection
> 3: I0123 19:22:05.933465 16001 authenticatee.cpp:213] Received SASL 
> authentication mechanisms: CRAM-MD5
> 3: I0123 19:22:05.933495 16001 authenticatee.cpp:239] Attempting to 
> authenticate with mechanism 'CRAM-MD5'
> 3: I0123 19:22:05.933631 15987 authenticator.cpp:204] Received SASL 
> authentication start
> 3: I0123 19:22:05.933712 15987 authenticator.cpp:326] Authentication requires 
> more steps
> 3: I0123 19:22:05.933851 15987 authenticatee.cpp:259] Received SASL 
> authentication step
> 3: I0123 19:22:05.934006 15987 authenticator.cpp:232] Received SASL 
> authentication step
> 3: I0123 19:22:05.934041 15987 auxprop.cpp:109] Request to lookup properties 
> for user: 'test-principal' realm: '455912973e2c' server FQDN: '455912973e2c' 
> SASL_AUXPROP_VERIFY_AGAINST_HASH: false SASL_AUXPROP_OVERRIDE: false 
> SASL_AUXPROP_AUTHZID: false 
> 3: I0123 19:22:05.934095 15987 auxprop.cpp:181] Looking up auxiliary property 
> '*userPassword'
> 3: I0123 19:22:05.934147 15987 auxprop.cpp:181] Looking up auxiliary property 
> '*cmusaslsecretCRAM-MD5'
> 3: I0123 19:22:05.934279 15987 auxprop.cpp:109] Request to lookup properties 
> for user: 'test-principal' realm: '455912973e2c' server FQDN: '455912973e2c' 
> SASL_AUXPROP_VERIFY_AGAINST_HASH: false SASL_AUXPROP_OVERRIDE: false 
> SASL_AUXPROP_AUTHZID: true 
> 3: I0123 19:22:05.934298 15987 auxprop.cpp:131] Skipping auxiliary property 
> '*userPassword' since SASL_AUXPROP_AUTHZID == true
> 3: I0123 19:22:05.934307 15987 auxprop.cpp:131] Skipping auxiliary property 
> '*cmusaslsecretCRAM-MD5' since SASL_AUXPROP_AUTHZID == true
> 3: I0123 19:22:05.934324 15987 authenticator.cpp:318] Authentication success
> 3: I0123 19:22:05.934463 15995 authenticatee.cpp:299] Authentication success
> 3: I0123 19:22:05.934563 16002 master.cpp:8988] Successfully authenticated 
> principal 'test-principal' at slave(442)@172.17.0.2:45634
> 3: I0123 19:22:05.934708 15993 authenticator.cpp:432] Authentication session 
> cleanup for crammd5-authenticatee(870)@172.17.0.2:45634
> 3: I0123 19:22:05.934891 15995 slave.cpp:1320] Successfully authenticated 
> with master master@172.17.0.2:45634
> 3: I0123 19:22:05.935261 15995 slave.cpp:1764] Will retry registration in 
> 2.234083ms if necessary
> 3: I0123 19:22:05.935436 15999 master.cpp:6061] Received register agent 
> message from slave(442)@172.17.0.2:45634 (455912973e2c)
> 3: I0123 19:22:05.935662 15999 master.cpp:3867] Authorizing agent with 
> principal 'test-principal'
> 3: I0123 19:22:05.936161 15992 master.cpp:6123] Authorized registration of 
> agent at slave(442)@172.17.0.2:45634 (455912973e2c)
> 3: I0123 19:22:05.936261 15992 master.cpp:6234] Registering agent at 
> slave(442)@172.17.0.2:45634 (455912973e2c) with id 
> eef8ea11-9247-44f3-84cf-340b24df3a52-S0
> 3: I0123 19:22:05.936993 15989 registrar.cpp:495] Applied 1 operations in 
> 227911ns; attempting to update the registry
> 3: I0123 19:22:05.937814 15989 

[jira] [Commented] (MESOS-8485) MasterTest.RegistryGcByCount is flaky

2018-02-02 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16350451#comment-16350451
 ] 

Andrei Budnik commented on MESOS-8485:
--

[~bennoe] Can it be stably reproduced by adding several Sleeps into tests/code?

> MasterTest.RegistryGcByCount is flaky
> -
>
> Key: MESOS-8485
> URL: https://issues.apache.org/jira/browse/MESOS-8485
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Affects Versions: 1.5.0
>Reporter: Vinod Kone
>Assignee: Benno Evers
>Priority: Major
>  Labels: flaky-test
>
> Observed this while testing Mesos 1.5.0-rc1 in ASF CI.
>  
> {code}
> 3: [ RUN      ] MasterTest.RegistryGcByCount
> ..snip...
> 3: I0123 19:22:05.929347 15994 slave.cpp:1201] Detecting new master
> 3: I0123 19:22:05.931701 15988 slave.cpp:1228] Authenticating with master 
> master@172.17.0.2:45634
> 3: I0123 19:22:05.931838 15988 slave.cpp:1237] Using default CRAM-MD5 
> authenticatee
> 3: I0123 19:22:05.932153 15999 authenticatee.cpp:121] Creating new client 
> SASL connection
> 3: I0123 19:22:05.932580 15992 master.cpp:8958] Authenticating 
> slave(442)@172.17.0.2:45634
> 3: I0123 19:22:05.932822 15990 authenticator.cpp:414] Starting authentication 
> session for crammd5-authenticatee(870)@172.17.0.2:45634
> 3: I0123 19:22:05.933163 15989 authenticator.cpp:98] Creating new server SASL 
> connection
> 3: I0123 19:22:05.933465 16001 authenticatee.cpp:213] Received SASL 
> authentication mechanisms: CRAM-MD5
> 3: I0123 19:22:05.933495 16001 authenticatee.cpp:239] Attempting to 
> authenticate with mechanism 'CRAM-MD5'
> 3: I0123 19:22:05.933631 15987 authenticator.cpp:204] Received SASL 
> authentication start
> 3: I0123 19:22:05.933712 15987 authenticator.cpp:326] Authentication requires 
> more steps
> 3: I0123 19:22:05.933851 15987 authenticatee.cpp:259] Received SASL 
> authentication step
> 3: I0123 19:22:05.934006 15987 authenticator.cpp:232] Received SASL 
> authentication step
> 3: I0123 19:22:05.934041 15987 auxprop.cpp:109] Request to lookup properties 
> for user: 'test-principal' realm: '455912973e2c' server FQDN: '455912973e2c' 
> SASL_AUXPROP_VERIFY_AGAINST_HASH: false SASL_AUXPROP_OVERRIDE: false 
> SASL_AUXPROP_AUTHZID: false 
> 3: I0123 19:22:05.934095 15987 auxprop.cpp:181] Looking up auxiliary property 
> '*userPassword'
> 3: I0123 19:22:05.934147 15987 auxprop.cpp:181] Looking up auxiliary property 
> '*cmusaslsecretCRAM-MD5'
> 3: I0123 19:22:05.934279 15987 auxprop.cpp:109] Request to lookup properties 
> for user: 'test-principal' realm: '455912973e2c' server FQDN: '455912973e2c' 
> SASL_AUXPROP_VERIFY_AGAINST_HASH: false SASL_AUXPROP_OVERRIDE: false 
> SASL_AUXPROP_AUTHZID: true 
> 3: I0123 19:22:05.934298 15987 auxprop.cpp:131] Skipping auxiliary property 
> '*userPassword' since SASL_AUXPROP_AUTHZID == true
> 3: I0123 19:22:05.934307 15987 auxprop.cpp:131] Skipping auxiliary property 
> '*cmusaslsecretCRAM-MD5' since SASL_AUXPROP_AUTHZID == true
> 3: I0123 19:22:05.934324 15987 authenticator.cpp:318] Authentication success
> 3: I0123 19:22:05.934463 15995 authenticatee.cpp:299] Authentication success
> 3: I0123 19:22:05.934563 16002 master.cpp:8988] Successfully authenticated 
> principal 'test-principal' at slave(442)@172.17.0.2:45634
> 3: I0123 19:22:05.934708 15993 authenticator.cpp:432] Authentication session 
> cleanup for crammd5-authenticatee(870)@172.17.0.2:45634
> 3: I0123 19:22:05.934891 15995 slave.cpp:1320] Successfully authenticated 
> with master master@172.17.0.2:45634
> 3: I0123 19:22:05.935261 15995 slave.cpp:1764] Will retry registration in 
> 2.234083ms if necessary
> 3: I0123 19:22:05.935436 15999 master.cpp:6061] Received register agent 
> message from slave(442)@172.17.0.2:45634 (455912973e2c)
> 3: I0123 19:22:05.935662 15999 master.cpp:3867] Authorizing agent with 
> principal 'test-principal'
> 3: I0123 19:22:05.936161 15992 master.cpp:6123] Authorized registration of 
> agent at slave(442)@172.17.0.2:45634 (455912973e2c)
> 3: I0123 19:22:05.936261 15992 master.cpp:6234] Registering agent at 
> slave(442)@172.17.0.2:45634 (455912973e2c) with id 
> eef8ea11-9247-44f3-84cf-340b24df3a52-S0
> 3: I0123 19:22:05.936993 15989 registrar.cpp:495] Applied 1 operations in 
> 227911ns; attempting to update the registry
> 3: I0123 19:22:05.937814 15989 registrar.cpp:552] Successfully updated the 
> registry in 743168ns
> 3: I0123 19:22:05.938057 15991 master.cpp:6282] Admitted agent 
> eef8ea11-9247-44f3-84cf-340b24df3a52-S0 at slave(442)@172.17.0.2:45634 
> (455912973e2c)
> 3: I0123 19:22:05.938891 15991 master.cpp:6331] Registered agent 
> eef8ea11-9247-44f3-84cf-340b24df3a52-S0 at slave(442)@172.17.0.2:45634 
> (455912973e2c) with cpus:2; mem:1024; disk:1024; ports:[31000-32000]
> 3: I0123 

[jira] [Commented] (MESOS-6616) Error: dereferencing type-punned pointer will break strict-aliasing rules.

2018-02-02 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16350313#comment-16350313
 ] 

Andrei Budnik commented on MESOS-6616:
--

[~alexr] Can these patches be backported to 1.4.x?

> Error: dereferencing type-punned pointer will break strict-aliasing rules.
> --
>
> Key: MESOS-6616
> URL: https://issues.apache.org/jira/browse/MESOS-6616
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.1.0, 1.2.3, 1.3.1, 1.4.1
> Environment: Fedora Rawhide;
> Debian 8.10 + gcc 5.5.0-6 with {{O2}}
>Reporter: Orion Poplawski
>Assignee: Alexander Rukletsov
>Priority: Major
>  Labels: compile-error, mesosphere
> Fix For: 1.5.0
>
>
> Trying to update the mesos package to 1.1.0 in Fedora.  Getting:
> {noformat}
> libtool: compile:  g++ -DPACKAGE_NAME=\"mesos\" -DPACKAGE_TARNAME=\"mesos\" 
> -DPACKAGE_VERSION=\"1.1.0\" "-DPACKAGE_STRING=\"mesos 1.1.0\"" 
> -DPACKAGE_BUGREPORT=\"\" -DPACKAGE_URL=\"\" -DPACKAGE=\"mesos\" 
> -DVERSION=\"1.1.0\" -DSTDC_HEADERS=1 -DHAVE_SYS_TYPES_H=1 -DHAVE_SYS_STAT_H=1 
> -DHAVE_STDLIB_H=1 -DHAVE_STRING_H=1 -DHAVE_MEMORY_H=1 -DHAVE_STRINGS_H=1 
> -DHAVE_INTTYPES_H=1 -DHAVE_STDINT_H=1 -DHAVE_UNISTD_H=1 -DHAVE_DLFCN_H=1 
> -DLT_OBJDIR=\".libs/\" -DHAVE_CXX11=1 -DHAVE_PTHREAD_PRIO_INHERIT=1 
> -DHAVE_PTHREAD=1 -DHAVE_LIBZ=1 -DHAVE_FTS_H=1 -DHAVE_APR_POOLS_H=1 
> -DHAVE_LIBAPR_1=1 -DHAVE_BOOST_VERSION_HPP=1 -DHAVE_LIBCURL=1 
> -DHAVE_ELFIO_ELFIO_HPP=1 -DHAVE_GLOG_LOGGING_H=1 -DHAVE_HTTP_PARSER_H=1 
> -DMESOS_HAS_JAVA=1 -DHAVE_LEVELDB_DB_H=1 -DHAVE_LIBNL_3=1 
> -DHAVE_LIBNL_ROUTE_3=1 -DHAVE_LIBNL_IDIAG_3=1 -DWITH_NETWORK_ISOLATOR=1 
> -DHAVE_GOOGLE_PROTOBUF_MESSAGE_H=1 -DHAVE_EV_H=1 -DHAVE_PICOJSON_H=1 
> -DHAVE_LIBSASL2=1 -DHAVE_SVN_VERSION_H=1 -DHAVE_LIBSVN_SUBR_1=1 
> -DHAVE_SVN_DELTA_H=1 -DHAVE_LIBSVN_DELTA_1=1 -DHAVE_LIBZ=1 
> -DHAVE_ZOOKEEPER_H=1 -DHAVE_PYTHON=\"2.7\" -DMESOS_HAS_PYTHON=1 -I. -Wall 
> -Werror -Wsign-compare -DLIBDIR=\"/usr/lib64\" 
> -DPKGLIBEXECDIR=\"/usr/libexec/mesos\" -DPKGDATADIR=\"/usr/share/mesos\" 
> -DPKGMODULEDIR=\"/usr/lib64/mesos/modules\" -I../include -I../include 
> -I../include/mesos -DPICOJSON_USE_INT64 -D__STDC_FORMAT_MACROS 
> -I../3rdparty/libprocess/include -I../3rdparty/nvml-352.79 
> -I../3rdparty/stout/include -DHAS_AUTHENTICATION=1 -Iyes/include 
> -I/usr/include/subversion-1 -Iyes/include -Iyes/include -Iyes/include/libnl3 
> -Iyes/include -I/ -Iyes/include -I/usr/include/apr-1 -I/usr/include/apr-1.0 
> -I/builddir/build/BUILD/mesos-1.1.0/libev-4.15/include -isystem yes/include 
> -Iyes/include -I/usr/src/gmock -I/usr/src/gmock/include -I/usr/src/gmock/src 
> -I/usr/src/gmock/gtest -I/usr/src/gmock/gtest/include 
> -I/usr/src/gmock/gtest/src -Iyes/include -Iyes/include -I/usr/include 
> -I/builddir/build/BUILD/mesos-1.1.0/libev4.15/include -Iyes/include 
> -I/usr/include -I/usr/include/zookeeper -pthread -O2 -g -pipe -Wall 
> -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -fexceptions 
> -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches 
> -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -m64 -mtune=generic 
> -DEV_CHILD_ENABLE=0 -I/builddir/build/BUILD/mesos-1.1.0/libev-4.15 
> -Wno-unused-local-typedefs -Wno-maybe-uninitialized -std=c++11 -c 
> health-check/health_checker.cpp  -fPIC -DPIC -o 
> health-check/.libs/libmesos_no_3rdparty_la-health_checker.o
> In file included from health-check/health_checker.cpp:51:0:
> ./linux/ns.hpp: In function 'Try ns::clone(pid_t, int, const 
> std::function&, int)':
> ./linux/ns.hpp:480:69: error: dereferencing type-punned pointer will break 
> strict-aliasing rules [-Werror=strict-aliasing]
>  pid_t pid = ((struct ucred*) CMSG_DATA(CMSG_FIRSTHDR()))->pid;
>  ^~
> ./linux/ns.hpp: In lambda function:
> ./linux/ns.hpp:581:59: error: dereferencing type-punned pointer will break 
> strict-aliasing rules [-Werror=strict-aliasing]
>((struct ucred*) CMSG_DATA(CMSG_FIRSTHDR()))->pid = ::getpid();
>^~
> ./linux/ns.hpp:582:59: error: dereferencing type-punned pointer will break 
> strict-aliasing rules [-Werror=strict-aliasing]
>((struct ucred*) CMSG_DATA(CMSG_FIRSTHDR()))->uid = ::getuid();
>^~
> ./linux/ns.hpp:583:59: error: dereferencing type-punned pointer will break 
> strict-aliasing rules [-Werror=strict-aliasing]
>((struct ucred*) CMSG_DATA(CMSG_FIRSTHDR()))->gid = ::getgid();
>^~
> cc1plus: all warnings being treated as errors
> make[2]: *** [Makefile:6655: 
> health-check/libmesos_no_3rdparty_la-health_checker.lo] 

[jira] [Commented] (MESOS-8521) IOSwitchboardTest::ContainerAttach fails on macOS.

2018-02-01 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16348731#comment-16348731
 ] 

Andrei Budnik commented on MESOS-8521:
--

 
{code:java}
IOSwitchboardTest.ContainerAttach
IOSwitchboardTest.ContainerAttachAfterSlaveRestart
IOSwitchboardTest.OutputRedirectionWithTTY
ContentType/AgentAPITest.LaunchNestedContainerSessionWithTTY/0
ContentType/AgentAPITest.LaunchNestedContainerSessionWithTTY/1{code}
All these ^^ tests are passing --tty="true" flag into mesos-io-switchboard 
process.

 

> IOSwitchboardTest::ContainerAttach fails on macOS. 
> ---
>
> Key: MESOS-8521
> URL: https://issues.apache.org/jira/browse/MESOS-8521
> Project: Mesos
>  Issue Type: Bug
> Environment: macOS 10.13.2 (17C88)
> Apple LLVM version 9.0.0 (clang-900.0.39.2)
>Reporter: Till Toenshoff
>Assignee: Andrei Budnik
>Priority: Major
>
> The problem appears to cause several switchboard tests to fail. Note that 
> this problem does not manifest on older Apple systems.
> The failure rate on this system is 100%.
> This is an example using {{GLOG=v1}} verbose logging:
> {noformat}
> [ RUN  ] IOSwitchboardTest.ContainerAttach
> I0201 03:02:51.925930 2385417024 containerizer.cpp:304] Using isolation { 
> environment_secret, filesystem/posix, posix/cpu }
> I0201 03:02:51.926230 2385417024 provisioner.cpp:299] Using default backend 
> 'copy'
> I0201 03:02:51.927325 107409408 containerizer.cpp:674] Recovering 
> containerizer
> I0201 03:02:51.928336 109019136 provisioner.cpp:495] Provisioner recovery 
> complete
> I0201 03:02:51.934250 105799680 containerizer.cpp:1202] Starting container 
> 1b1af888-9e39-4c13-a647-ac43c0df9fad
> I0201 03:02:51.936218 105799680 containerizer.cpp:1368] Checkpointed 
> ContainerConfig at 
> '/var/folders/_t/rdp354gx7j5fjww270kbk6_rgn/T/IOSwitchboardTest_ContainerAttach_1nkPYl/containers/1b1af888-9e39-4c13-a647-ac43c0df9fad/config'
> I0201 03:02:51.936251 105799680 containerizer.cpp:2952] Transitioning the 
> state of container 1b1af888-9e39-4c13-a647-ac43c0df9fad from PROVISIONING to 
> PREPARING
> I0201 03:02:51.937369 109019136 switchboard.cpp:429] Allocated pseudo 
> terminal '/dev/ttys003' for container 1b1af888-9e39-4c13-a647-ac43c0df9fad
> I0201 03:02:51.943632 109019136 switchboard.cpp:557] Launching 
> 'mesos-io-switchboard' with flags '--heartbeat_interval="30secs" 
> --help="false" 
> --socket_address="/tmp/mesos-io-switchboard-d3bcec3f-7c29-4630-b374-55fabb6034d8"
>  --stderr_from_fd="7" --stderr_to_fd="2" --stdin_to_fd="7" 
> --stdout_from_fd="7" --stdout_to_fd="1" --tty="true" 
> --wait_for_connection="false"' for container 
> 1b1af888-9e39-4c13-a647-ac43c0df9fad
> I0201 03:02:51.945106 109019136 switchboard.cpp:587] Created I/O switchboard 
> server (pid: 83716) listening on socket file 
> '/tmp/mesos-io-switchboard-d3bcec3f-7c29-4630-b374-55fabb6034d8' for 
> container 1b1af888-9e39-4c13-a647-ac43c0df9fad
> I0201 03:02:51.947762 106336256 containerizer.cpp:1844] Launching 
> 'mesos-containerizer' with flags '--help="false" 
> --launch_info="{"command":{"shell":true,"value":"sleep 
> 1000"},"environment":{"variables":[{"name":"MESOS_SANDBOX","type":"VALUE","value":"\/var\/folders\/_t\/rdp354gx7j5fjww270kbk6_rgn\/T\/IOSwitchboardTest_ContainerAttach_W9gDw0"}]},"task_environment":{},"tty_slave_path":"\/dev\/ttys003","working_directory":"\/var\/folders\/_t\/rdp354gx7j5fjww270kbk6_rgn\/T\/IOSwitchboardTest_ContainerAttach_W9gDw0"}"
>  --pipe_read="7" --pipe_write="10" 
> --runtime_directory="/var/folders/_t/rdp354gx7j5fjww270kbk6_rgn/T/IOSwitchboardTest_ContainerAttach_1nkPYl/containers/1b1af888-9e39-4c13-a647-ac43c0df9fad"'
> I0201 03:02:51.949144 106336256 launcher.cpp:140] Forked child with pid 
> '83717' for container '1b1af888-9e39-4c13-a647-ac43c0df9fad'
> I0201 03:02:51.949896 106336256 containerizer.cpp:2952] Transitioning the 
> state of container 1b1af888-9e39-4c13-a647-ac43c0df9fad from PREPARING to 
> ISOLATING
> I0201 03:02:51.951071 106336256 containerizer.cpp:2952] Transitioning the 
> state of container 1b1af888-9e39-4c13-a647-ac43c0df9fad from ISOLATING to 
> FETCHING
> I0201 03:02:51.951190 108482560 fetcher.cpp:369] Starting to fetch URIs for 
> container: 1b1af888-9e39-4c13-a647-ac43c0df9fad, directory: 
> /var/folders/_t/rdp354gx7j5fjww270kbk6_rgn/T/IOSwitchboardTest_ContainerAttach_W9gDw0
> I0201 03:02:51.951791 109019136 containerizer.cpp:2952] Transitioning the 
> state of container 1b1af888-9e39-4c13-a647-ac43c0df9fad from FETCHING to 
> RUNNING
> I0201 03:02:52.076602 106872832 containerizer.cpp:2338] Destroying container 
> 1b1af888-9e39-4c13-a647-ac43c0df9fad in RUNNING state
> I0201 03:02:52.076644 106872832 containerizer.cpp:2952] Transitioning the 
> state of container 

[jira] [Assigned] (MESOS-8521) IOSwitchboardTest::ContainerAttach fails on macOS.

2018-02-01 Thread Andrei Budnik (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrei Budnik reassigned MESOS-8521:


Assignee: Andrei Budnik

> IOSwitchboardTest::ContainerAttach fails on macOS. 
> ---
>
> Key: MESOS-8521
> URL: https://issues.apache.org/jira/browse/MESOS-8521
> Project: Mesos
>  Issue Type: Bug
> Environment: macOS 10.13.2 (17C88)
> Apple LLVM version 9.0.0 (clang-900.0.39.2)
>Reporter: Till Toenshoff
>Assignee: Andrei Budnik
>Priority: Major
>
> The problem appears to cause several switchboard tests to fail. Note that 
> this problem does not manifest on older Apple systems.
> The failure rate on this system is 100%.
> This is an example using {{GLOG=v1}} verbose logging:
> {noformat}
> [ RUN  ] IOSwitchboardTest.ContainerAttach
> I0201 03:02:51.925930 2385417024 containerizer.cpp:304] Using isolation { 
> environment_secret, filesystem/posix, posix/cpu }
> I0201 03:02:51.926230 2385417024 provisioner.cpp:299] Using default backend 
> 'copy'
> I0201 03:02:51.927325 107409408 containerizer.cpp:674] Recovering 
> containerizer
> I0201 03:02:51.928336 109019136 provisioner.cpp:495] Provisioner recovery 
> complete
> I0201 03:02:51.934250 105799680 containerizer.cpp:1202] Starting container 
> 1b1af888-9e39-4c13-a647-ac43c0df9fad
> I0201 03:02:51.936218 105799680 containerizer.cpp:1368] Checkpointed 
> ContainerConfig at 
> '/var/folders/_t/rdp354gx7j5fjww270kbk6_rgn/T/IOSwitchboardTest_ContainerAttach_1nkPYl/containers/1b1af888-9e39-4c13-a647-ac43c0df9fad/config'
> I0201 03:02:51.936251 105799680 containerizer.cpp:2952] Transitioning the 
> state of container 1b1af888-9e39-4c13-a647-ac43c0df9fad from PROVISIONING to 
> PREPARING
> I0201 03:02:51.937369 109019136 switchboard.cpp:429] Allocated pseudo 
> terminal '/dev/ttys003' for container 1b1af888-9e39-4c13-a647-ac43c0df9fad
> I0201 03:02:51.943632 109019136 switchboard.cpp:557] Launching 
> 'mesos-io-switchboard' with flags '--heartbeat_interval="30secs" 
> --help="false" 
> --socket_address="/tmp/mesos-io-switchboard-d3bcec3f-7c29-4630-b374-55fabb6034d8"
>  --stderr_from_fd="7" --stderr_to_fd="2" --stdin_to_fd="7" 
> --stdout_from_fd="7" --stdout_to_fd="1" --tty="true" 
> --wait_for_connection="false"' for container 
> 1b1af888-9e39-4c13-a647-ac43c0df9fad
> I0201 03:02:51.945106 109019136 switchboard.cpp:587] Created I/O switchboard 
> server (pid: 83716) listening on socket file 
> '/tmp/mesos-io-switchboard-d3bcec3f-7c29-4630-b374-55fabb6034d8' for 
> container 1b1af888-9e39-4c13-a647-ac43c0df9fad
> I0201 03:02:51.947762 106336256 containerizer.cpp:1844] Launching 
> 'mesos-containerizer' with flags '--help="false" 
> --launch_info="{"command":{"shell":true,"value":"sleep 
> 1000"},"environment":{"variables":[{"name":"MESOS_SANDBOX","type":"VALUE","value":"\/var\/folders\/_t\/rdp354gx7j5fjww270kbk6_rgn\/T\/IOSwitchboardTest_ContainerAttach_W9gDw0"}]},"task_environment":{},"tty_slave_path":"\/dev\/ttys003","working_directory":"\/var\/folders\/_t\/rdp354gx7j5fjww270kbk6_rgn\/T\/IOSwitchboardTest_ContainerAttach_W9gDw0"}"
>  --pipe_read="7" --pipe_write="10" 
> --runtime_directory="/var/folders/_t/rdp354gx7j5fjww270kbk6_rgn/T/IOSwitchboardTest_ContainerAttach_1nkPYl/containers/1b1af888-9e39-4c13-a647-ac43c0df9fad"'
> I0201 03:02:51.949144 106336256 launcher.cpp:140] Forked child with pid 
> '83717' for container '1b1af888-9e39-4c13-a647-ac43c0df9fad'
> I0201 03:02:51.949896 106336256 containerizer.cpp:2952] Transitioning the 
> state of container 1b1af888-9e39-4c13-a647-ac43c0df9fad from PREPARING to 
> ISOLATING
> I0201 03:02:51.951071 106336256 containerizer.cpp:2952] Transitioning the 
> state of container 1b1af888-9e39-4c13-a647-ac43c0df9fad from ISOLATING to 
> FETCHING
> I0201 03:02:51.951190 108482560 fetcher.cpp:369] Starting to fetch URIs for 
> container: 1b1af888-9e39-4c13-a647-ac43c0df9fad, directory: 
> /var/folders/_t/rdp354gx7j5fjww270kbk6_rgn/T/IOSwitchboardTest_ContainerAttach_W9gDw0
> I0201 03:02:51.951791 109019136 containerizer.cpp:2952] Transitioning the 
> state of container 1b1af888-9e39-4c13-a647-ac43c0df9fad from FETCHING to 
> RUNNING
> I0201 03:02:52.076602 106872832 containerizer.cpp:2338] Destroying container 
> 1b1af888-9e39-4c13-a647-ac43c0df9fad in RUNNING state
> I0201 03:02:52.076644 106872832 containerizer.cpp:2952] Transitioning the 
> state of container 1b1af888-9e39-4c13-a647-ac43c0df9fad from RUNNING to 
> DESTROYING
> I0201 03:02:52.076920 106872832 launcher.cpp:156] Asked to destroy container 
> 1b1af888-9e39-4c13-a647-ac43c0df9fad
> I0201 03:02:52.158571 107945984 containerizer.cpp:2791] Container 
> 1b1af888-9e39-4c13-a647-ac43c0df9fad has exited
> I0201 03:02:57.162788 110092288 switchboard.cpp:790] Sending SIGTERM to I/O 
> switchboard 

[jira] [Commented] (MESOS-8513) Noisy "transport endpoint is not connected" logs on closing sockets.

2018-01-31 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347243#comment-16347243
 ] 

Andrei Budnik commented on MESOS-8513:
--

This log message gave us a good hint while debugging MESOS-8247.
 However, after [https://reviews.apache.org/r/64032] has been landed, it seems 
like verbosity for "Failed to shutdown socket" message can be safely decreased 
to VLOG.

> Noisy "transport endpoint is not connected" logs on closing sockets.
> 
>
> Key: MESOS-8513
> URL: https://issues.apache.org/jira/browse/MESOS-8513
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.4.1, 1.5.0, 1.6.0
>Reporter: Till Toenshoff
>Assignee: Till Toenshoff
>Priority: Minor
>  Labels: libprocess, logging, socket
>
> When within libprocess a socket is closing, we try to shut it down. That 
> shutdown fails as the socket is not connected. This is intended behavior. The 
> error code returned {{ENOTCONN}} tells us that there is nothing to see here 
> for such common scenario.
> The problem appears to be the logging of such event - that might appear as 
> not useful - no matter which log-level is used.
> {noformat}
> E1214 08:15:18.017247 20752 process.cpp:2401] Failed to shutdown socket with 
> fd 288: Transport endpoint is not connected
> {noformat}
> We should try to prevent this specific, non actionable logging entirely while 
> making sure we do not hinder debugging scenarios.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (MESOS-8489) LinuxCapabilitiesIsolatorFlagsTest.ROOT_IsolatorFlags is flaky

2018-01-29 Thread Andrei Budnik (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrei Budnik updated MESOS-8489:
-
Shepherd: Gilbert Song

> LinuxCapabilitiesIsolatorFlagsTest.ROOT_IsolatorFlags is flaky
> --
>
> Key: MESOS-8489
> URL: https://issues.apache.org/jira/browse/MESOS-8489
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: Andrei Budnik
>Assignee: Andrei Budnik
>Priority: Major
>  Labels: containerizer, flaky-test, mesosphere
> Attachments: ROOT_IsolatorFlags-badrun3.txt
>
>
> Observed this on internal Mesosphere CI.
> {code:java}
> ../../src/tests/cluster.cpp:662: Failure
> Value of: containers->empty()
>   Actual: false
> Expected: true
> Failed to destroy containers: { test }
> {code}
> h2. Steps to reproduce
>  # Add {{::sleep(1);}} before 
> [removing|https://github.com/apache/mesos/blob/e91ce42ed56c5ab65220fbba740a8a50c7f835ae/src/linux/cgroups.cpp#L483]
>  "test" cgroup
>  # recompile
>  # run `GLOG_v=2 sudo GLOG_v=2 ./src/mesos-tests 
> --gtest_filter=LinuxCapabilitiesIsolatorFlagsTest.ROOT_IsolatorFlags 
> --gtest_break_on_failure --gtest_repeat=10 --verbose`
> h2. Race description
> While recovery is in progress for [the first 
> slave|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/tests/containerizer/linux_capabilities_isolator_tests.cpp#L733],
>  calling 
> [`StartSlave()`|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/tests/containerizer/linux_capabilities_isolator_tests.cpp#L738]
>  leads to calling 
> [`slave::Containerizer::create()`|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/tests/cluster.cpp#L431]
>  to create a containerizer. An attempt to create a mesos c'zer, leads to 
> calling 
> [`cgroups::prepare`|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/slave/containerizer/mesos/linux_launcher.cpp#L124].
>  Finally, we get to the point, where we try to create a ["test" 
> container|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/linux/cgroups.cpp#L476].
>  So, the recovery process for the first slave [might 
> detect|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/slave/containerizer/mesos/linux_launcher.cpp#L268-L301]
>  this "test" container as an orphaned container.
> Thus, there is the race between recovery process for the first slave and an 
> attempt to create a c'zer for the second agent.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-8489) LinuxCapabilitiesIsolatorFlagsTest.ROOT_IsolatorFlags is flaky

2018-01-25 Thread Andrei Budnik (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrei Budnik reassigned MESOS-8489:


Assignee: Andrei Budnik

> LinuxCapabilitiesIsolatorFlagsTest.ROOT_IsolatorFlags is flaky
> --
>
> Key: MESOS-8489
> URL: https://issues.apache.org/jira/browse/MESOS-8489
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: Andrei Budnik
>Assignee: Andrei Budnik
>Priority: Major
>  Labels: containerizer, flaky-test, mesosphere
> Attachments: ROOT_IsolatorFlags-badrun3.txt
>
>
> Observed this on internal Mesosphere CI.
> {code:java}
> ../../src/tests/cluster.cpp:662: Failure
> Value of: containers->empty()
>   Actual: false
> Expected: true
> Failed to destroy containers: { test }
> {code}
> h2. Steps to reproduce
>  # Add {{::sleep(1);}} before 
> [removing|https://github.com/apache/mesos/blob/e91ce42ed56c5ab65220fbba740a8a50c7f835ae/src/linux/cgroups.cpp#L483]
>  "test" cgroup
>  # recompile
>  # run `GLOG_v=2 sudo GLOG_v=2 ./src/mesos-tests 
> --gtest_filter=LinuxCapabilitiesIsolatorFlagsTest.ROOT_IsolatorFlags 
> --gtest_break_on_failure --gtest_repeat=10 --verbose`
> h2. Race description
> While recovery is in progress for [the first 
> slave|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/tests/containerizer/linux_capabilities_isolator_tests.cpp#L733],
>  calling 
> [`StartSlave()`|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/tests/containerizer/linux_capabilities_isolator_tests.cpp#L738]
>  leads to calling 
> [`slave::Containerizer::create()`|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/tests/cluster.cpp#L431]
>  to create a containerizer. An attempt to create a mesos c'zer, leads to 
> calling 
> [`cgroups::prepare`|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/slave/containerizer/mesos/linux_launcher.cpp#L124].
>  Finally, we get to the point, where we try to create a ["test" 
> container|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/linux/cgroups.cpp#L476].
>  So, the recovery process for the first slave [might 
> detect|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/slave/containerizer/mesos/linux_launcher.cpp#L268-L301]
>  this "test" container as an orphaned container.
> Thus, there is the race between recovery process for the first slave and an 
> attempt to create a c'zer for the second agent.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-7506) Multiple tests leave orphan containers.

2018-01-25 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16339151#comment-16339151
 ] 

Andrei Budnik commented on MESOS-7506:
--

Created a separate MESOS-8489 ticket for this ^^

> Multiple tests leave orphan containers.
> ---
>
> Key: MESOS-7506
> URL: https://issues.apache.org/jira/browse/MESOS-7506
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
> Environment: Ubuntu 16.04
> Fedora 23
> other Linux distros
>Reporter: Alexander Rukletsov
>Assignee: Andrei Budnik
>Priority: Major
>  Labels: containerizer, flaky-test, mesosphere
> Attachments: KillMultipleTasks-badrun.txt, 
> ROOT_IsolatorFlags-badrun.txt, ROOT_IsolatorFlags-badrun2.txt, 
> ROOT_IsolatorFlags-badrun3.txt, ReconcileTasksMissingFromSlave-badrun.txt, 
> ResourceLimitation-badrun.txt, ResourceLimitation-badrun2.txt, 
> RestartSlaveRequireExecutorAuthentication-badrun.txt, 
> TaskWithFileURI-badrun.txt
>
>
> I've observed a number of flaky tests that leave orphan containers upon 
> cleanup. A typical log looks like this:
> {noformat}
> ../../src/tests/cluster.cpp:580: Failure
> Value of: containers->empty()
>   Actual: false
> Expected: true
> Failed to destroy containers: { da3e8aa8-98e7-4e72-a8fd-5d0bae960014 }
> {noformat}
> All currently affected tests:
> {noformat}
> SlaveTest.RestartSlaveRequireExecutorAuthentication // cannot reproduce any 
> more
> ROOT_IsolatorFlags
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8489) LinuxCapabilitiesIsolatorFlagsTest.ROOT_IsolatorFlags is flaky

2018-01-25 Thread Andrei Budnik (JIRA)
Andrei Budnik created MESOS-8489:


 Summary: LinuxCapabilitiesIsolatorFlagsTest.ROOT_IsolatorFlags is 
flaky
 Key: MESOS-8489
 URL: https://issues.apache.org/jira/browse/MESOS-8489
 Project: Mesos
  Issue Type: Bug
  Components: containerization
Reporter: Andrei Budnik
 Attachments: ROOT_IsolatorFlags-badrun3.txt

Observed this on internal Mesosphere CI.
{code:java}
../../src/tests/cluster.cpp:662: Failure
Value of: containers->empty()
  Actual: false
Expected: true
Failed to destroy containers: { test }
{code}
h2. Steps to reproduce
 # Add {{::sleep(1);}} before 
[removing|https://github.com/apache/mesos/blob/e91ce42ed56c5ab65220fbba740a8a50c7f835ae/src/linux/cgroups.cpp#L483]
 "test" cgroup
 # recompile
 # run `GLOG_v=2 sudo GLOG_v=2 ./src/mesos-tests 
--gtest_filter=LinuxCapabilitiesIsolatorFlagsTest.ROOT_IsolatorFlags 
--gtest_break_on_failure --gtest_repeat=10 --verbose`

h2. Race description

While recovery is in progress for [the first 
slave|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/tests/containerizer/linux_capabilities_isolator_tests.cpp#L733],
 calling 
[`StartSlave()`|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/tests/containerizer/linux_capabilities_isolator_tests.cpp#L738]
 leads to calling 
[`slave::Containerizer::create()`|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/tests/cluster.cpp#L431]
 to create a containerizer. An attempt to create a mesos c'zer, leads to 
calling 
[`cgroups::prepare`|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/slave/containerizer/mesos/linux_launcher.cpp#L124].
 Finally, we get to the point, where we try to create a ["test" 
container|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/linux/cgroups.cpp#L476].
 So, the recovery process for the first slave [might 
detect|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/slave/containerizer/mesos/linux_launcher.cpp#L268-L301]
 this "test" container as an orphaned container.

Thus, there is the race between recovery process for the first slave and an 
attempt to create a c'zer for the second agent.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-7506) Multiple tests leave orphan containers.

2018-01-23 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16336272#comment-16336272
 ] 

Andrei Budnik edited comment on MESOS-7506 at 1/23/18 7:20 PM:
---

While recovery is in progress for [the first 
slave|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/tests/containerizer/linux_capabilities_isolator_tests.cpp#L733],
 calling 
[`StartSlave()`|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/tests/containerizer/linux_capabilities_isolator_tests.cpp#L738]
 leads to calling 
[slave::Containerizer::create()|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/tests/cluster.cpp#L431]
 to create a containerizer. An attempt to create a mesos c'zer, leads to 
calling 
[`cgroups::prepare`|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/slave/containerizer/mesos/linux_launcher.cpp#L124].
 Finally, we get to the point, where we try to create a ["test" 
container|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/linux/cgroups.cpp#L476].
 So, the recovery process for the first slave [might 
detect|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/slave/containerizer/mesos/linux_launcher.cpp#L268-L301]
 this "test" container as an orphaned container.

So, there is the race between recovery process for the first slave and an 
attempt to create a c'zer for the second agent.


was (Author: abudnik):
While recovery is in progress for [the first 
slave|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/tests/containerizer/linux_capabilities_isolator_tests.cpp#L733],
 calling 
[`StartSlave()`|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/tests/containerizer/linux_capabilities_isolator_tests.cpp#L738]
 leads to calling 
[slave::Containerizer::create()|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/tests/cluster.cpp#L431]
 to create a containerizer. An attempt to create a mesos c'zer, leads to 
calling 
[`cgroups::prepare`|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/slave/containerizer/mesos/linux_launcher.cpp#L124].
 Finally, we get to the point, where we try to create a ["test" 
container|[https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/linux/cgroups.cpp#L476].]
 So, the recovery process for the first slave [might 
detect|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/slave/containerizer/mesos/linux_launcher.cpp#L268-L301]
 this "test" container as an orphaned container.

So, there is the race between recovery process for the first slave and an 
attempt to create a c'zer for the second agent.

> Multiple tests leave orphan containers.
> ---
>
> Key: MESOS-7506
> URL: https://issues.apache.org/jira/browse/MESOS-7506
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
> Environment: Ubuntu 16.04
> Fedora 23
> other Linux distros
>Reporter: Alexander Rukletsov
>Assignee: Andrei Budnik
>Priority: Major
>  Labels: containerizer, flaky-test, mesosphere
> Attachments: KillMultipleTasks-badrun.txt, 
> ROOT_IsolatorFlags-badrun.txt, ROOT_IsolatorFlags-badrun2.txt, 
> ROOT_IsolatorFlags-badrun3.txt, ReconcileTasksMissingFromSlave-badrun.txt, 
> ResourceLimitation-badrun.txt, ResourceLimitation-badrun2.txt, 
> RestartSlaveRequireExecutorAuthentication-badrun.txt, 
> TaskWithFileURI-badrun.txt
>
>
> I've observed a number of flaky tests that leave orphan containers upon 
> cleanup. A typical log looks like this:
> {noformat}
> ../../src/tests/cluster.cpp:580: Failure
> Value of: containers->empty()
>   Actual: false
> Expected: true
> Failed to destroy containers: { da3e8aa8-98e7-4e72-a8fd-5d0bae960014 }
> {noformat}
> All currently affected tests:
> {noformat}
> SlaveTest.RestartSlaveRequireExecutorAuthentication // cannot reproduce any 
> more
> ROOT_IsolatorFlags
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-7506) Multiple tests leave orphan containers.

2018-01-23 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16336272#comment-16336272
 ] 

Andrei Budnik commented on MESOS-7506:
--

While recovery is in progress for [the first 
slave|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/tests/containerizer/linux_capabilities_isolator_tests.cpp#L733],
 calling 
[`StartSlave()`|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/tests/containerizer/linux_capabilities_isolator_tests.cpp#L738]
 leads to calling 
[slave::Containerizer::create()|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/tests/cluster.cpp#L431]
 to create a containerizer. An attempt to create a mesos c'zer, leads to 
calling 
[`cgroups::prepare`|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/slave/containerizer/mesos/linux_launcher.cpp#L124].
 Finally, we get to the point, where we try to create a ["test" 
container|[https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/linux/cgroups.cpp#L476].]
 So, the recovery process for the first slave [might 
detect|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/slave/containerizer/mesos/linux_launcher.cpp#L268-L301]
 this "test" container as an orphaned container.

So, there is the race between recovery process for the first slave and an 
attempt to create a c'zer for the second agent.

> Multiple tests leave orphan containers.
> ---
>
> Key: MESOS-7506
> URL: https://issues.apache.org/jira/browse/MESOS-7506
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
> Environment: Ubuntu 16.04
> Fedora 23
> other Linux distros
>Reporter: Alexander Rukletsov
>Assignee: Andrei Budnik
>Priority: Major
>  Labels: containerizer, flaky-test, mesosphere
> Attachments: KillMultipleTasks-badrun.txt, 
> ROOT_IsolatorFlags-badrun.txt, ROOT_IsolatorFlags-badrun2.txt, 
> ROOT_IsolatorFlags-badrun3.txt, ReconcileTasksMissingFromSlave-badrun.txt, 
> ResourceLimitation-badrun.txt, ResourceLimitation-badrun2.txt, 
> RestartSlaveRequireExecutorAuthentication-badrun.txt, 
> TaskWithFileURI-badrun.txt
>
>
> I've observed a number of flaky tests that leave orphan containers upon 
> cleanup. A typical log looks like this:
> {noformat}
> ../../src/tests/cluster.cpp:580: Failure
> Value of: containers->empty()
>   Actual: false
> Expected: true
> Failed to destroy containers: { da3e8aa8-98e7-4e72-a8fd-5d0bae960014 }
> {noformat}
> All currently affected tests:
> {noformat}
> SlaveTest.RestartSlaveRequireExecutorAuthentication // cannot reproduce any 
> more
> ROOT_IsolatorFlags
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-7506) Multiple tests leave orphan containers.

2018-01-23 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16336134#comment-16336134
 ] 

Andrei Budnik commented on MESOS-7506:
--

Steps to reproduce `LinuxCapabilitiesIsolatorFlagsTest.ROOT_IsolatorFlags`:
 # Add {{::sleep(1);}} before 
[removing|https://github.com/apache/mesos/blob/e91ce42ed56c5ab65220fbba740a8a50c7f835ae/src/linux/cgroups.cpp#L483]
 "test" cgroup
 # recompile
 # run `GLOG_v=2 sudo GLOG_v=2 ./src/mesos-tests 
--gtest_filter=LinuxCapabilitiesIsolatorFlagsTest.ROOT_IsolatorFlags 
--gtest_break_on_failure --gtest_repeat=10 --verbose`

> Multiple tests leave orphan containers.
> ---
>
> Key: MESOS-7506
> URL: https://issues.apache.org/jira/browse/MESOS-7506
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
> Environment: Ubuntu 16.04
> Fedora 23
> other Linux distros
>Reporter: Alexander Rukletsov
>Assignee: Andrei Budnik
>Priority: Major
>  Labels: containerizer, flaky-test, mesosphere
> Attachments: KillMultipleTasks-badrun.txt, 
> ROOT_IsolatorFlags-badrun.txt, ROOT_IsolatorFlags-badrun2.txt, 
> ROOT_IsolatorFlags-badrun3.txt, ReconcileTasksMissingFromSlave-badrun.txt, 
> ResourceLimitation-badrun.txt, ResourceLimitation-badrun2.txt, 
> RestartSlaveRequireExecutorAuthentication-badrun.txt, 
> TaskWithFileURI-badrun.txt
>
>
> I've observed a number of flaky tests that leave orphan containers upon 
> cleanup. A typical log looks like this:
> {noformat}
> ../../src/tests/cluster.cpp:580: Failure
> Value of: containers->empty()
>   Actual: false
> Expected: true
> Failed to destroy containers: { da3e8aa8-98e7-4e72-a8fd-5d0bae960014 }
> {noformat}
> All currently affected tests:
> {noformat}
> SlaveTest.RestartSlaveRequireExecutorAuthentication // cannot reproduce any 
> more
> ROOT_IsolatorFlags
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-7742) ContentType/AgentAPIStreamingTest.AttachInputToNestedContainerSession is flaky

2018-01-22 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16334227#comment-16334227
 ] 

Andrei Budnik commented on MESOS-7742:
--

https://reviews.apache.org/r/65261/

I think this patch provides a better solution than retrying to 
[connect|https://github.com/apache/mesos/blob/336e932199643e88c0edbea7c1f08d4b45596389/src/slave/containerizer/mesos/io/switchboard.cpp#L696-L700],
because otherwise it's needed to:
# Use one more `loop` for retrying logic
# Define the limit of retry attempts and delay between attempts
# It might retry to connect due to some non-ECONNREFUSED error

> ContentType/AgentAPIStreamingTest.AttachInputToNestedContainerSession is flaky
> --
>
> Key: MESOS-7742
> URL: https://issues.apache.org/jira/browse/MESOS-7742
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Affects Versions: 1.5.0
>Reporter: Vinod Kone
>Assignee: Andrei Budnik
>Priority: Major
>  Labels: flaky-test, mesosphere-oncall
> Fix For: 1.6.0
>
> Attachments: AgentAPITest.LaunchNestedContainerSession-badrun.txt, 
> LaunchNestedContainerSessionDisconnected-badrun.txt
>
>
> Observed this on ASF CI and internal Mesosphere CI. Affected tests:
> {noformat}
> AgentAPIStreamingTest.AttachInputToNestedContainerSession
> AgentAPITest.LaunchNestedContainerSession
> AgentAPITest.AttachContainerInputAuthorization/0
> AgentAPITest.LaunchNestedContainerSessionWithTTY/0
> AgentAPITest.LaunchNestedContainerSessionDisconnected/1
> {noformat}
> This issue comes at least in three different flavours. Take 
> {{AgentAPIStreamingTest.AttachInputToNestedContainerSession}} as an example.
> h5. Flavour 1
> {noformat}
> ../../src/tests/api_tests.cpp:6473
> Value of: (response).get().status
>   Actual: "503 Service Unavailable"
> Expected: http::OK().status
> Which is: "200 OK"
> Body: ""
> {noformat}
> h5. Flavour 2
> {noformat}
> ../../src/tests/api_tests.cpp:6473
> Value of: (response).get().status
>   Actual: "500 Internal Server Error"
> Expected: http::OK().status
> Which is: "200 OK"
> Body: "Disconnected"
> {noformat}
> h5. Flavour 3
> {noformat}
> /home/ubuntu/workspace/mesos/Mesos_CI-build/FLAG/CMake/label/mesos-ec2-ubuntu-16.04/mesos/src/tests/api_tests.cpp:6367
> Value of: (sessionResponse).get().status
>   Actual: "500 Internal Server Error"
> Expected: http::OK().status
> Which is: "200 OK"
> Body: ""
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-7742) ContentType/AgentAPIStreamingTest.AttachInputToNestedContainerSession is flaky

2018-01-18 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16330630#comment-16330630
 ] 

Andrei Budnik commented on MESOS-7742:
--

Steps to reproduce second cause:
1. Add a {{::sleep(2);}} after [binding unix 
socket|https://github.com/apache/mesos/blob/634c8af2618c57a1405d20717fa909b399486f37/src/slave/containerizer/mesos/io/switchboard.cpp#L1056].
2. Recompile `make && make check`.
3. Launch a test:
{code:}
GLOG_v=2 sudo GLOG_v=2 ./src/mesos-tests 
--gtest_filter=ContentType/AgentAPITest.LaunchNestedContainerSession/0 
--gtest_break_on_failure --gtest_repeat=1 --verbose
{code}


> ContentType/AgentAPIStreamingTest.AttachInputToNestedContainerSession is flaky
> --
>
> Key: MESOS-7742
> URL: https://issues.apache.org/jira/browse/MESOS-7742
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Affects Versions: 1.5.0
>Reporter: Vinod Kone
>Assignee: Andrei Budnik
>Priority: Major
>  Labels: flaky-test, mesosphere-oncall
> Fix For: 1.6.0
>
> Attachments: AgentAPITest.LaunchNestedContainerSession-badrun.txt, 
> LaunchNestedContainerSessionDisconnected-badrun.txt
>
>
> Observed this on ASF CI and internal Mesosphere CI. Affected tests:
> {noformat}
> AgentAPIStreamingTest.AttachInputToNestedContainerSession
> AgentAPITest.LaunchNestedContainerSession
> AgentAPITest.AttachContainerInputAuthorization/0
> AgentAPITest.LaunchNestedContainerSessionWithTTY/0
> AgentAPITest.LaunchNestedContainerSessionDisconnected/1
> {noformat}
> This issue comes at least in three different flavours. Take 
> {{AgentAPIStreamingTest.AttachInputToNestedContainerSession}} as an example.
> h5. Flavour 1
> {noformat}
> ../../src/tests/api_tests.cpp:6473
> Value of: (response).get().status
>   Actual: "503 Service Unavailable"
> Expected: http::OK().status
> Which is: "200 OK"
> Body: ""
> {noformat}
> h5. Flavour 2
> {noformat}
> ../../src/tests/api_tests.cpp:6473
> Value of: (response).get().status
>   Actual: "500 Internal Server Error"
> Expected: http::OK().status
> Which is: "200 OK"
> Body: "Disconnected"
> {noformat}
> h5. Flavour 3
> {noformat}
> /home/ubuntu/workspace/mesos/Mesos_CI-build/FLAG/CMake/label/mesos-ec2-ubuntu-16.04/mesos/src/tests/api_tests.cpp:6367
> Value of: (sessionResponse).get().status
>   Actual: "500 Internal Server Error"
> Expected: http::OK().status
> Which is: "200 OK"
> Body: ""
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-7742) ContentType/AgentAPIStreamingTest.AttachInputToNestedContainerSession is flaky

2018-01-17 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16329006#comment-16329006
 ] 

Andrei Budnik commented on MESOS-7742:
--

These patches ^^ are fixing the first cause described in the [first 
patch|https://reviews.apache.org/r/65122/].

There is a second cause when an attempt to connect to IO-Switchboard fails with:
{code:java}
I1109 23:47:25.016929 27803 process.cpp:3982] Failed to process request for 
'/slave(812)/api/v1': Failed to connect to 
/tmp/mesos-io-switchboard-56bcba4b-6e81-4aeb-a0e9-41309ec991b5: Connection 
refused
W1109 23:47:25.017009 27803 http.cpp:2944] Failed to attach to nested container 
7ab572dd-78b5-4186-93af-7ac011990f80.b77944da-f1d5-4694-a51b-8fde150c5f7a: 
Failed to connect to 
/tmp/mesos-io-switchboard-56bcba4b-6e81-4aeb-a0e9-41309ec991b5: Connection 
refused
I1109 23:47:25.017063 27803 process.cpp:1590] Returning '500 Internal Server 
Error' for '/slave(812)/api/v1' (Failed to connect to 
/tmp/mesos-io-switchboard-56bcba4b-6e81-4aeb-a0e9-41309ec991b5: Connection 
refused)
{code}
The reason for this failure needs to be investigated.

> ContentType/AgentAPIStreamingTest.AttachInputToNestedContainerSession is flaky
> --
>
> Key: MESOS-7742
> URL: https://issues.apache.org/jira/browse/MESOS-7742
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Affects Versions: 1.5.0
>Reporter: Vinod Kone
>Assignee: Andrei Budnik
>Priority: Major
>  Labels: flaky-test, mesosphere-oncall
> Fix For: 1.6.0
>
> Attachments: AgentAPITest.LaunchNestedContainerSession-badrun.txt, 
> LaunchNestedContainerSessionDisconnected-badrun.txt
>
>
> Observed this on ASF CI and internal Mesosphere CI. Affected tests:
> {noformat}
> AgentAPIStreamingTest.AttachInputToNestedContainerSession
> AgentAPITest.LaunchNestedContainerSession
> AgentAPITest.AttachContainerInputAuthorization/0
> AgentAPITest.LaunchNestedContainerSessionWithTTY/0
> AgentAPITest.LaunchNestedContainerSessionDisconnected/1
> {noformat}
> This issue comes at least in three different flavours. Take 
> {{AgentAPIStreamingTest.AttachInputToNestedContainerSession}} as an example.
> h5. Flavour 1
> {noformat}
> ../../src/tests/api_tests.cpp:6473
> Value of: (response).get().status
>   Actual: "503 Service Unavailable"
> Expected: http::OK().status
> Which is: "200 OK"
> Body: ""
> {noformat}
> h5. Flavour 2
> {noformat}
> ../../src/tests/api_tests.cpp:6473
> Value of: (response).get().status
>   Actual: "500 Internal Server Error"
> Expected: http::OK().status
> Which is: "200 OK"
> Body: "Disconnected"
> {noformat}
> h5. Flavour 3
> {noformat}
> /home/ubuntu/workspace/mesos/Mesos_CI-build/FLAG/CMake/label/mesos-ec2-ubuntu-16.04/mesos/src/tests/api_tests.cpp:6367
> Value of: (sessionResponse).get().status
>   Actual: "500 Internal Server Error"
> Expected: http::OK().status
> Which is: "200 OK"
> Body: ""
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (MESOS-7028) NetSocketTest.EOFBeforeRecv is flaky

2018-01-16 Thread Andrei Budnik (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrei Budnik updated MESOS-7028:
-
Sprint: Mesosphere Sprint 72
Remaining Estimate: 5m
 Original Estimate: 5m

> NetSocketTest.EOFBeforeRecv is flaky
> 
>
> Key: MESOS-7028
> URL: https://issues.apache.org/jira/browse/MESOS-7028
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess, test
> Environment: ASF CI, autotools, gcc, CentOS 7, libevent/SSL enabled;
> Mac OS with SSL enabled;
> CentOS 6 with SSL enabled;
>Reporter: Greg Mann
>Assignee: Andrei Budnik
>Priority: Major
>  Labels: flaky, flaky-test, libprocess, mesosphere, socket, ssl
> Attachments: NetSocketTest.EOFBeforeRecv-vlog3.txt
>
>   Original Estimate: 5m
>  Remaining Estimate: 5m
>
> This was observed on ASF CI:
> {code}
> [ RUN  ] Encryption/NetSocketTest.EOFBeforeRecv/0
> I0128 03:48:51.444228 27745 openssl.cpp:419] CA file path is unspecified! 
> NOTE: Set CA file path with LIBPROCESS_SSL_CA_FILE=
> I0128 03:48:51.444252 27745 openssl.cpp:424] CA directory path unspecified! 
> NOTE: Set CA directory path with LIBPROCESS_SSL_CA_DIR=
> I0128 03:48:51.444257 27745 openssl.cpp:429] Will not verify peer certificate!
> NOTE: Set LIBPROCESS_SSL_VERIFY_CERT=1 to enable peer certificate verification
> I0128 03:48:51.444262 27745 openssl.cpp:435] Will only verify peer 
> certificate if presented!
> NOTE: Set LIBPROCESS_SSL_REQUIRE_CERT=1 to require peer certificate 
> verification
> I0128 03:48:51.447341 27745 process.cpp:1246] libprocess is initialized on 
> 172.17.0.2:45515 with 16 worker threads
> ../../../3rdparty/libprocess/src/tests/socket_tests.cpp:196: Failure
> Failed to wait 15secs for client->recv()
> [  FAILED  ] Encryption/NetSocketTest.EOFBeforeRecv/0, where GetParam() = 
> "SSL" (15269 ms)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (MESOS-7028) NetSocketTest.EOFBeforeRecv is flaky

2018-01-16 Thread Andrei Budnik (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrei Budnik updated MESOS-7028:
-
  Story Points: 5
Remaining Estimate: (was: 5m)
 Original Estimate: (was: 5m)

> NetSocketTest.EOFBeforeRecv is flaky
> 
>
> Key: MESOS-7028
> URL: https://issues.apache.org/jira/browse/MESOS-7028
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess, test
> Environment: ASF CI, autotools, gcc, CentOS 7, libevent/SSL enabled;
> Mac OS with SSL enabled;
> CentOS 6 with SSL enabled;
>Reporter: Greg Mann
>Assignee: Andrei Budnik
>Priority: Major
>  Labels: flaky, flaky-test, libprocess, mesosphere, socket, ssl
> Attachments: NetSocketTest.EOFBeforeRecv-vlog3.txt
>
>
> This was observed on ASF CI:
> {code}
> [ RUN  ] Encryption/NetSocketTest.EOFBeforeRecv/0
> I0128 03:48:51.444228 27745 openssl.cpp:419] CA file path is unspecified! 
> NOTE: Set CA file path with LIBPROCESS_SSL_CA_FILE=
> I0128 03:48:51.444252 27745 openssl.cpp:424] CA directory path unspecified! 
> NOTE: Set CA directory path with LIBPROCESS_SSL_CA_DIR=
> I0128 03:48:51.444257 27745 openssl.cpp:429] Will not verify peer certificate!
> NOTE: Set LIBPROCESS_SSL_VERIFY_CERT=1 to enable peer certificate verification
> I0128 03:48:51.444262 27745 openssl.cpp:435] Will only verify peer 
> certificate if presented!
> NOTE: Set LIBPROCESS_SSL_REQUIRE_CERT=1 to require peer certificate 
> verification
> I0128 03:48:51.447341 27745 process.cpp:1246] libprocess is initialized on 
> 172.17.0.2:45515 with 16 worker threads
> ../../../3rdparty/libprocess/src/tests/socket_tests.cpp:196: Failure
> Failed to wait 15secs for client->recv()
> [  FAILED  ] Encryption/NetSocketTest.EOFBeforeRecv/0, where GetParam() = 
> "SSL" (15269 ms)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-7028) NetSocketTest.EOFBeforeRecv is flaky

2018-01-16 Thread Andrei Budnik (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrei Budnik reassigned MESOS-7028:


Assignee: Andrei Budnik  (was: Greg Mann)

> NetSocketTest.EOFBeforeRecv is flaky
> 
>
> Key: MESOS-7028
> URL: https://issues.apache.org/jira/browse/MESOS-7028
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess, test
> Environment: ASF CI, autotools, gcc, CentOS 7, libevent/SSL enabled;
> Mac OS with SSL enabled;
> CentOS 6 with SSL enabled;
>Reporter: Greg Mann
>Assignee: Andrei Budnik
>Priority: Major
>  Labels: flaky, flaky-test, libprocess, mesosphere, socket, ssl
> Attachments: NetSocketTest.EOFBeforeRecv-vlog3.txt
>
>
> This was observed on ASF CI:
> {code}
> [ RUN  ] Encryption/NetSocketTest.EOFBeforeRecv/0
> I0128 03:48:51.444228 27745 openssl.cpp:419] CA file path is unspecified! 
> NOTE: Set CA file path with LIBPROCESS_SSL_CA_FILE=
> I0128 03:48:51.444252 27745 openssl.cpp:424] CA directory path unspecified! 
> NOTE: Set CA directory path with LIBPROCESS_SSL_CA_DIR=
> I0128 03:48:51.444257 27745 openssl.cpp:429] Will not verify peer certificate!
> NOTE: Set LIBPROCESS_SSL_VERIFY_CERT=1 to enable peer certificate verification
> I0128 03:48:51.444262 27745 openssl.cpp:435] Will only verify peer 
> certificate if presented!
> NOTE: Set LIBPROCESS_SSL_REQUIRE_CERT=1 to require peer certificate 
> verification
> I0128 03:48:51.447341 27745 process.cpp:1246] libprocess is initialized on 
> 172.17.0.2:45515 with 16 worker threads
> ../../../3rdparty/libprocess/src/tests/socket_tests.cpp:196: Failure
> Failed to wait 15secs for client->recv()
> [  FAILED  ] Encryption/NetSocketTest.EOFBeforeRecv/0, where GetParam() = 
> "SSL" (15269 ms)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-7028) NetSocketTest.EOFBeforeRecv is flaky

2018-01-15 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16326364#comment-16326364
 ] 

Andrei Budnik commented on MESOS-7028:
--

Steps to reproduce:
1) Add a `{{::sleep(1);}}` after 
[server_socket.shutdown()|https://github.com/apache/mesos/blob/4959887230a7d7c55629083be978810f48b780a3/3rdparty/libprocess/src/tests/socket_tests.cpp#L195]
 
2) recompile `make check`
3) launch the test:
{code:java}
GLOG_v=3 sudo GLOG_v=3 ./3rdparty/libprocess/libprocess-tests 
--gtest_filter=Encryption/NetSocketTest.EOFBeforeRecv/0 
--gtest_break_on_failure --gtest_repeat=1 --verbose
{code}

> NetSocketTest.EOFBeforeRecv is flaky
> 
>
> Key: MESOS-7028
> URL: https://issues.apache.org/jira/browse/MESOS-7028
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess, test
> Environment: ASF CI, autotools, gcc, CentOS 7, libevent/SSL enabled;
> Mac OS with SSL enabled;
> CentOS 6 with SSL enabled;
>Reporter: Greg Mann
>Assignee: Greg Mann
>Priority: Major
>  Labels: flaky, flaky-test, libprocess, mesosphere, socket, ssl
> Attachments: NetSocketTest.EOFBeforeRecv-vlog3.txt
>
>
> This was observed on ASF CI:
> {code}
> [ RUN  ] Encryption/NetSocketTest.EOFBeforeRecv/0
> I0128 03:48:51.444228 27745 openssl.cpp:419] CA file path is unspecified! 
> NOTE: Set CA file path with LIBPROCESS_SSL_CA_FILE=
> I0128 03:48:51.444252 27745 openssl.cpp:424] CA directory path unspecified! 
> NOTE: Set CA directory path with LIBPROCESS_SSL_CA_DIR=
> I0128 03:48:51.444257 27745 openssl.cpp:429] Will not verify peer certificate!
> NOTE: Set LIBPROCESS_SSL_VERIFY_CERT=1 to enable peer certificate verification
> I0128 03:48:51.444262 27745 openssl.cpp:435] Will only verify peer 
> certificate if presented!
> NOTE: Set LIBPROCESS_SSL_REQUIRE_CERT=1 to require peer certificate 
> verification
> I0128 03:48:51.447341 27745 process.cpp:1246] libprocess is initialized on 
> 172.17.0.2:45515 with 16 worker threads
> ../../../3rdparty/libprocess/src/tests/socket_tests.cpp:196: Failure
> Failed to wait 15secs for client->recv()
> [  FAILED  ] Encryption/NetSocketTest.EOFBeforeRecv/0, where GetParam() = 
> "SSL" (15269 ms)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (MESOS-7028) NetSocketTest.EOFBeforeRecv is flaky

2018-01-15 Thread Andrei Budnik (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrei Budnik updated MESOS-7028:
-
Attachment: NetSocketTest.EOFBeforeRecv-vlog3.txt

> NetSocketTest.EOFBeforeRecv is flaky
> 
>
> Key: MESOS-7028
> URL: https://issues.apache.org/jira/browse/MESOS-7028
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess, test
> Environment: ASF CI, autotools, gcc, CentOS 7, libevent/SSL enabled;
> Mac OS with SSL enabled;
> CentOS 6 with SSL enabled;
>Reporter: Greg Mann
>Assignee: Greg Mann
>Priority: Major
>  Labels: flaky, flaky-test, libprocess, mesosphere, socket, ssl
> Attachments: NetSocketTest.EOFBeforeRecv-vlog3.txt
>
>
> This was observed on ASF CI:
> {code}
> [ RUN  ] Encryption/NetSocketTest.EOFBeforeRecv/0
> I0128 03:48:51.444228 27745 openssl.cpp:419] CA file path is unspecified! 
> NOTE: Set CA file path with LIBPROCESS_SSL_CA_FILE=
> I0128 03:48:51.444252 27745 openssl.cpp:424] CA directory path unspecified! 
> NOTE: Set CA directory path with LIBPROCESS_SSL_CA_DIR=
> I0128 03:48:51.444257 27745 openssl.cpp:429] Will not verify peer certificate!
> NOTE: Set LIBPROCESS_SSL_VERIFY_CERT=1 to enable peer certificate verification
> I0128 03:48:51.444262 27745 openssl.cpp:435] Will only verify peer 
> certificate if presented!
> NOTE: Set LIBPROCESS_SSL_REQUIRE_CERT=1 to require peer certificate 
> verification
> I0128 03:48:51.447341 27745 process.cpp:1246] libprocess is initialized on 
> 172.17.0.2:45515 with 16 worker threads
> ../../../3rdparty/libprocess/src/tests/socket_tests.cpp:196: Failure
> Failed to wait 15secs for client->recv()
> [  FAILED  ] Encryption/NetSocketTest.EOFBeforeRecv/0, where GetParam() = 
> "SSL" (15269 ms)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-7742) ContentType/AgentAPIStreamingTest.AttachInputToNestedContainerSession is flaky

2018-01-12 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16318666#comment-16318666
 ] 

Andrei Budnik edited comment on MESOS-7742 at 1/12/18 12:21 PM:


io switchboard 
[terminates|https://github.com/apache/mesos/blob/3d8ef23c0ecec028641d7beee4c85233495a030b/src/slave/containerizer/mesos/io/switchboard.cpp#L1218]
 itself when io redirect is finished.
If io switchboard terminates before it receives {{\r\n\r\n}} or before agent 
receives {{200 OK}} response from the io switchboard, connection to the agent 
(via unix socket) will be closed, so agent's {{ConnectionProcess}} will handle 
this case as an unexpected [EOF| 
https://github.com/apache/mesos/blob/3290b401d20f2db2933294470ea8a2356a47c305/3rdparty/libprocess/src/http.cpp#L1293
 
https://github.com/apache/mesos/blob/3290b401d20f2db2933294470ea8a2356a47c305/3rdparty/libprocess/src/http.cpp#L1293]
 on 
[reading|https://github.com/apache/mesos/blob/3290b401d20f2db2933294470ea8a2356a47c305/3rdparty/libprocess/src/http.cpp#L1216]
 of the response. That will lead to {{500 Internal Server Error}} response from 
the agent for {{ATTACH_CONTAINER_INPUT}} request.


was (Author: abudnik):
As we have launched 
[`cat`|https://github.com/apache/mesos/blob/3290b401d20f2db2933294470ea8a2356a47c305/src/tests/api_tests.cpp#L6529]
 command as a nested container, related ioswitchboard process will be in the 
same process group. Whenever a process group leader ({{cat}}) terminates, all 
processes in the process group are killed, including ioswitchboard.
ioswitchboard handles HTTP requests from the slave, e.g. 
{{ATTACH_CONTAINER_INPUT}} request in this test.
Usually, after reading all client's data, {{Http::_attachContainerInput()}} 
invokes a callback which calls 
[writer.close()|https://github.com/apache/mesos/blob/3290b401d20f2db2933294470ea8a2356a47c305/src/slave/http.cpp#L3223].
[writer.close()|https://github.com/apache/mesos/blob/3290b401d20f2db2933294470ea8a2356a47c305/3rdparty/libprocess/src/http.cpp#L561]
 implies sending a 
[\r\n\r\n|https://github.com/apache/mesos/blob/3290b401d20f2db2933294470ea8a2356a47c305/3rdparty/libprocess/src/http.cpp#L1045]
 to the ioswitchboard process.
ioswitchboard returns [200 
OK|https://github.com/apache/mesos/blob/3290b401d20f2db2933294470ea8a2356a47c305/src/slave/containerizer/mesos/io/switchboard.cpp#L1572]
 response, hence agent returns {{200 OK}} for {{ATTACH_CONTAINER_INPUT}} 
request as expected.

However, if ioswitchboard terminates before it receives {{\r\n\r\n}} or before 
agent receives {{200 OK}} response from the ioswitchboard, connection (via unix 
socket) might be closed, so corresponding {{ConnectionProcess}} will handle 
this case as an unexpected [EOF| 
https://github.com/apache/mesos/blob/3290b401d20f2db2933294470ea8a2356a47c305/3rdparty/libprocess/src/http.cpp#L1293
 
https://github.com/apache/mesos/blob/3290b401d20f2db2933294470ea8a2356a47c305/3rdparty/libprocess/src/http.cpp#L1293]
 during 
[read|https://github.com/apache/mesos/blob/3290b401d20f2db2933294470ea8a2356a47c305/3rdparty/libprocess/src/http.cpp#L1216]
 of a response. That will lead to {{500 Internal Server Error}} response from 
the agent.

> ContentType/AgentAPIStreamingTest.AttachInputToNestedContainerSession is flaky
> --
>
> Key: MESOS-7742
> URL: https://issues.apache.org/jira/browse/MESOS-7742
> Project: Mesos
>  Issue Type: Bug
>Reporter: Vinod Kone
>Assignee: Andrei Budnik
>  Labels: flaky-test, mesosphere-oncall
> Attachments: AgentAPITest.LaunchNestedContainerSession-badrun.txt, 
> LaunchNestedContainerSessionDisconnected-badrun.txt
>
>
> Observed this on ASF CI and internal Mesosphere CI. Affected tests:
> {noformat}
> AgentAPIStreamingTest.AttachInputToNestedContainerSession
> AgentAPITest.LaunchNestedContainerSession
> AgentAPITest.AttachContainerInputAuthorization/0
> AgentAPITest.LaunchNestedContainerSessionWithTTY/0
> AgentAPITest.LaunchNestedContainerSessionDisconnected/1
> {noformat}
> This issue comes at least in three different flavours. Take 
> {{AgentAPIStreamingTest.AttachInputToNestedContainerSession}} as an example.
> h5. Flavour 1
> {noformat}
> ../../src/tests/api_tests.cpp:6473
> Value of: (response).get().status
>   Actual: "503 Service Unavailable"
> Expected: http::OK().status
> Which is: "200 OK"
> Body: ""
> {noformat}
> h5. Flavour 2
> {noformat}
> ../../src/tests/api_tests.cpp:6473
> Value of: (response).get().status
>   Actual: "500 Internal Server Error"
> Expected: http::OK().status
> Which is: "200 OK"
> Body: "Disconnected"
> {noformat}
> h5. Flavour 3
> {noformat}
> /home/ubuntu/workspace/mesos/Mesos_CI-build/FLAG/CMake/label/mesos-ec2-ubuntu-16.04/mesos/src/tests/api_tests.cpp:6367
> Value of: 

[jira] [Updated] (MESOS-7742) ContentType/AgentAPIStreamingTest.AttachInputToNestedContainerSession is flaky

2018-01-10 Thread Andrei Budnik (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrei Budnik updated MESOS-7742:
-
Sprint: Mesosphere Sprint 58, Mesosphere Sprint 72  (was: Mesosphere Sprint 
58)

> ContentType/AgentAPIStreamingTest.AttachInputToNestedContainerSession is flaky
> --
>
> Key: MESOS-7742
> URL: https://issues.apache.org/jira/browse/MESOS-7742
> Project: Mesos
>  Issue Type: Bug
>Reporter: Vinod Kone
>Assignee: Andrei Budnik
>  Labels: flaky-test, mesosphere-oncall
> Attachments: AgentAPITest.LaunchNestedContainerSession-badrun.txt, 
> LaunchNestedContainerSessionDisconnected-badrun.txt
>
>
> Observed this on ASF CI and internal Mesosphere CI. Affected tests:
> {noformat}
> AgentAPIStreamingTest.AttachInputToNestedContainerSession
> AgentAPITest.LaunchNestedContainerSession
> AgentAPITest.AttachContainerInputAuthorization/0
> AgentAPITest.LaunchNestedContainerSessionWithTTY/0
> AgentAPITest.LaunchNestedContainerSessionDisconnected/1
> {noformat}
> This issue comes at least in three different flavours. Take 
> {{AgentAPIStreamingTest.AttachInputToNestedContainerSession}} as an example.
> h5. Flavour 1
> {noformat}
> ../../src/tests/api_tests.cpp:6473
> Value of: (response).get().status
>   Actual: "503 Service Unavailable"
> Expected: http::OK().status
> Which is: "200 OK"
> Body: ""
> {noformat}
> h5. Flavour 2
> {noformat}
> ../../src/tests/api_tests.cpp:6473
> Value of: (response).get().status
>   Actual: "500 Internal Server Error"
> Expected: http::OK().status
> Which is: "200 OK"
> Body: "Disconnected"
> {noformat}
> h5. Flavour 3
> {noformat}
> /home/ubuntu/workspace/mesos/Mesos_CI-build/FLAG/CMake/label/mesos-ec2-ubuntu-16.04/mesos/src/tests/api_tests.cpp:6367
> Value of: (sessionResponse).get().status
>   Actual: "500 Internal Server Error"
> Expected: http::OK().status
> Which is: "200 OK"
> Body: ""
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (MESOS-8391) Mesos agent doesn't notice that a pod task exits or crashes after the agent restart

2018-01-10 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16320409#comment-16320409
 ] 

Andrei Budnik edited comment on MESOS-8391 at 1/10/18 6:47 PM:
---

https://reviews.apache.org/r/65071/
https://reviews.apache.org/r/65077/


was (Author: abudnik):
https://reviews.apache.org/r/65071/

> Mesos agent doesn't notice that a pod task exits or crashes after the agent 
> restart
> ---
>
> Key: MESOS-8391
> URL: https://issues.apache.org/jira/browse/MESOS-8391
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, containerization, executor
>Affects Versions: 1.5.0
>Reporter: Ivan Chernetsky
>Assignee: Andrei Budnik
>Priority: Blocker
> Attachments: testing-log-2.tar.gz
>
>
> h4. (1) Agent doesn't detect that a pod task exits/crashes
> # Create a Marathon pod with two containers which just do {{sleep 1}}.
> # Restart the Mesos agent on the node the pod got launched.
> # Kill one of the pod tasks
> *Expected result*: The Mesos agent detects that one of the tasks got killed, 
> and forwards {{TASK_FAILED}} status to Marathon.
> *Actual result*: The Mesos agent does nothing, and the Mesos master thinks 
> that both tasks are running just fine. Marathon doesn't take any action 
> because it doesn't receive any update from Mesos.
> h4. (2) After the agent restart, it detects that the task crashed, forwards 
> the correct status update, but the other task stays in {{TASK_KILLING}} state 
> forever
> # Perform steps in (1).
> # Restart the Mesos agent
> *Expected result*: The Mesos agent detects that one of the tasks got crashed, 
> forwards the corresponding status update, and kills the other task too.
> *Actual result*: The Mesos agent detects that one of the tasks got crashed, 
> forwards the corresponding status update, but the other task stays in 
> `TASK_KILLING` state forever.
> Please note, that after another agent restart, the other tasks gets finally 
> killed and the correct status updates get propagated all the way to Marathon.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (MESOS-8391) Mesos agent doesn't notice that a pod task exits or crashes after the agent restart

2018-01-10 Thread Andrei Budnik (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrei Budnik reassigned MESOS-8391:


Assignee: Andrei Budnik  (was: Gilbert Song)

> Mesos agent doesn't notice that a pod task exits or crashes after the agent 
> restart
> ---
>
> Key: MESOS-8391
> URL: https://issues.apache.org/jira/browse/MESOS-8391
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, containerization, executor
>Affects Versions: 1.5.0
>Reporter: Ivan Chernetsky
>Assignee: Andrei Budnik
>Priority: Blocker
> Attachments: testing-log-2.tar.gz
>
>
> h4. (1) Agent doesn't detect that a pod task exits/crashes
> # Create a Marathon pod with two containers which just do {{sleep 1}}.
> # Restart the Mesos agent on the node the pod got launched.
> # Kill one of the pod tasks
> *Expected result*: The Mesos agent detects that one of the tasks got killed, 
> and forwards {{TASK_FAILED}} status to Marathon.
> *Actual result*: The Mesos agent does nothing, and the Mesos master thinks 
> that both tasks are running just fine. Marathon doesn't take any action 
> because it doesn't receive any update from Mesos.
> h4. (2) After the agent restart, it detects that the task crashed, forwards 
> the correct status update, but the other task stays in {{TASK_KILLING}} state 
> forever
> # Perform steps in (1).
> # Restart the Mesos agent
> *Expected result*: The Mesos agent detects that one of the tasks got crashed, 
> forwards the corresponding status update, and kills the other task too.
> *Actual result*: The Mesos agent detects that one of the tasks got crashed, 
> forwards the corresponding status update, but the other task stays in 
> `TASK_KILLING` state forever.
> Please note, that after another agent restart, the other tasks gets finally 
> killed and the correct status updates get propagated all the way to Marathon.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-7742) ContentType/AgentAPIStreamingTest.AttachInputToNestedContainerSession is flaky

2018-01-09 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16318666#comment-16318666
 ] 

Andrei Budnik commented on MESOS-7742:
--

As we have launched 
[`cat`|https://github.com/apache/mesos/blob/3290b401d20f2db2933294470ea8a2356a47c305/src/tests/api_tests.cpp#L6529]
 command as a nested container, related ioswitchboard process will be in the 
same process group. Whenever a process group leader ({{cat}}) terminates, all 
processes in the process group are killed, including ioswitchboard.
ioswitchboard handles HTTP requests from the slave, e.g. 
{{ATTACH_CONTAINER_INPUT}} request in this test.
Usually, after reading all client's data, {{Http::_attachContainerInput()}} 
invokes a callback which calls 
[writer.close()|https://github.com/apache/mesos/blob/3290b401d20f2db2933294470ea8a2356a47c305/src/slave/http.cpp#L3223].
[writer.close()|https://github.com/apache/mesos/blob/3290b401d20f2db2933294470ea8a2356a47c305/3rdparty/libprocess/src/http.cpp#L561]
 implies sending a 
[\r\n\r\n|https://github.com/apache/mesos/blob/3290b401d20f2db2933294470ea8a2356a47c305/3rdparty/libprocess/src/http.cpp#L1045]
 to the ioswitchboard process.
ioswitchboard returns [200 
OK|https://github.com/apache/mesos/blob/3290b401d20f2db2933294470ea8a2356a47c305/src/slave/containerizer/mesos/io/switchboard.cpp#L1572]
 response, hence agent returns {{200 OK}} for {{ATTACH_CONTAINER_INPUT}} 
request as expected.

However, if ioswitchboard terminates before it receives {{\r\n\r\n}} or before 
agent receives {{200 OK}} response from the ioswitchboard, connection (via unix 
socket) might be closed, so corresponding {{ConnectionProcess}} will handle 
this case as an unexpected [EOF| 
https://github.com/apache/mesos/blob/3290b401d20f2db2933294470ea8a2356a47c305/3rdparty/libprocess/src/http.cpp#L1293
 
https://github.com/apache/mesos/blob/3290b401d20f2db2933294470ea8a2356a47c305/3rdparty/libprocess/src/http.cpp#L1293]
 during 
[read|https://github.com/apache/mesos/blob/3290b401d20f2db2933294470ea8a2356a47c305/3rdparty/libprocess/src/http.cpp#L1216]
 of a response. That will lead to {{500 Internal Server Error}} response from 
the agent.

> ContentType/AgentAPIStreamingTest.AttachInputToNestedContainerSession is flaky
> --
>
> Key: MESOS-7742
> URL: https://issues.apache.org/jira/browse/MESOS-7742
> Project: Mesos
>  Issue Type: Bug
>Reporter: Vinod Kone
>Assignee: Andrei Budnik
>  Labels: flaky-test, mesosphere-oncall
> Attachments: AgentAPITest.LaunchNestedContainerSession-badrun.txt, 
> LaunchNestedContainerSessionDisconnected-badrun.txt
>
>
> Observed this on ASF CI and internal Mesosphere CI. Affected tests:
> {noformat}
> AgentAPIStreamingTest.AttachInputToNestedContainerSession
> AgentAPITest.LaunchNestedContainerSession
> AgentAPITest.AttachContainerInputAuthorization/0
> AgentAPITest.LaunchNestedContainerSessionWithTTY/0
> AgentAPITest.LaunchNestedContainerSessionDisconnected/1
> {noformat}
> This issue comes at least in three different flavours. Take 
> {{AgentAPIStreamingTest.AttachInputToNestedContainerSession}} as an example.
> h5. Flavour 1
> {noformat}
> ../../src/tests/api_tests.cpp:6473
> Value of: (response).get().status
>   Actual: "503 Service Unavailable"
> Expected: http::OK().status
> Which is: "200 OK"
> Body: ""
> {noformat}
> h5. Flavour 2
> {noformat}
> ../../src/tests/api_tests.cpp:6473
> Value of: (response).get().status
>   Actual: "500 Internal Server Error"
> Expected: http::OK().status
> Which is: "200 OK"
> Body: "Disconnected"
> {noformat}
> h5. Flavour 3
> {noformat}
> /home/ubuntu/workspace/mesos/Mesos_CI-build/FLAG/CMake/label/mesos-ec2-ubuntu-16.04/mesos/src/tests/api_tests.cpp:6367
> Value of: (sessionResponse).get().status
>   Actual: "500 Internal Server Error"
> Expected: http::OK().status
> Which is: "200 OK"
> Body: ""
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-7742) ContentType/AgentAPIStreamingTest.AttachInputToNestedContainerSession is flaky

2018-01-09 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16318567#comment-16318567
 ] 

Andrei Budnik commented on MESOS-7742:
--

How to reproduce Flavour 3:
Put a {{::sleep(1);}} before {{writer.close();}} in 
[Http::_attachContainerInput()|https://github.com/apache/mesos/blob/3290b401d20f2db2933294470ea8a2356a47c305/src/slave/http.cpp#L3222].

> ContentType/AgentAPIStreamingTest.AttachInputToNestedContainerSession is flaky
> --
>
> Key: MESOS-7742
> URL: https://issues.apache.org/jira/browse/MESOS-7742
> Project: Mesos
>  Issue Type: Bug
>Reporter: Vinod Kone
>Assignee: Andrei Budnik
>  Labels: flaky-test, mesosphere-oncall
> Attachments: AgentAPITest.LaunchNestedContainerSession-badrun.txt, 
> LaunchNestedContainerSessionDisconnected-badrun.txt
>
>
> Observed this on ASF CI and internal Mesosphere CI. Affected tests:
> {noformat}
> AgentAPIStreamingTest.AttachInputToNestedContainerSession
> AgentAPITest.LaunchNestedContainerSession
> AgentAPITest.AttachContainerInputAuthorization/0
> AgentAPITest.LaunchNestedContainerSessionWithTTY/0
> AgentAPITest.LaunchNestedContainerSessionDisconnected/1
> {noformat}
> This issue comes at least in three different flavours. Take 
> {{AgentAPIStreamingTest.AttachInputToNestedContainerSession}} as an example.
> h5. Flavour 1
> {noformat}
> ../../src/tests/api_tests.cpp:6473
> Value of: (response).get().status
>   Actual: "503 Service Unavailable"
> Expected: http::OK().status
> Which is: "200 OK"
> Body: ""
> {noformat}
> h5. Flavour 2
> {noformat}
> ../../src/tests/api_tests.cpp:6473
> Value of: (response).get().status
>   Actual: "500 Internal Server Error"
> Expected: http::OK().status
> Which is: "200 OK"
> Body: "Disconnected"
> {noformat}
> h5. Flavour 3
> {noformat}
> /home/ubuntu/workspace/mesos/Mesos_CI-build/FLAG/CMake/label/mesos-ec2-ubuntu-16.04/mesos/src/tests/api_tests.cpp:6367
> Value of: (sessionResponse).get().status
>   Actual: "500 Internal Server Error"
> Expected: http::OK().status
> Which is: "200 OK"
> Body: ""
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7506) Multiple tests leave orphan containers.

2018-01-09 Thread Andrei Budnik (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrei Budnik updated MESOS-7506:
-
Attachment: ROOT_IsolatorFlags-badrun2.txt

> Multiple tests leave orphan containers.
> ---
>
> Key: MESOS-7506
> URL: https://issues.apache.org/jira/browse/MESOS-7506
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
> Environment: Ubuntu 16.04
> Fedora 23
> other Linux distros
>Reporter: Alexander Rukletsov
>Assignee: Andrei Budnik
>  Labels: containerizer, flaky-test, mesosphere
> Attachments: KillMultipleTasks-badrun.txt, 
> ROOT_IsolatorFlags-badrun.txt, ROOT_IsolatorFlags-badrun2.txt, 
> ReconcileTasksMissingFromSlave-badrun.txt, ResourceLimitation-badrun.txt, 
> ResourceLimitation-badrun2.txt, 
> RestartSlaveRequireExecutorAuthentication-badrun.txt, 
> TaskWithFileURI-badrun.txt
>
>
> I've observed a number of flaky tests that leave orphan containers upon 
> cleanup. A typical log looks like this:
> {noformat}
> ../../src/tests/cluster.cpp:580: Failure
> Value of: containers->empty()
>   Actual: false
> Expected: true
> Failed to destroy containers: { da3e8aa8-98e7-4e72-a8fd-5d0bae960014 }
> {noformat}
> All currently affected tests:
> {noformat}
> SlaveTest.RestartSlaveRequireExecutorAuthentication
> LinuxCapabilitiesIsolatorFlagsTest.ROOT_IsolatorFlags
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (MESOS-7742) ContentType/AgentAPIStreamingTest.AttachInputToNestedContainerSession is flaky

2018-01-08 Thread Andrei Budnik (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrei Budnik reassigned MESOS-7742:


Assignee: Andrei Budnik

> ContentType/AgentAPIStreamingTest.AttachInputToNestedContainerSession is flaky
> --
>
> Key: MESOS-7742
> URL: https://issues.apache.org/jira/browse/MESOS-7742
> Project: Mesos
>  Issue Type: Bug
>Reporter: Vinod Kone
>Assignee: Andrei Budnik
>  Labels: flaky-test, mesosphere-oncall
> Attachments: AgentAPITest.LaunchNestedContainerSession-badrun.txt, 
> LaunchNestedContainerSessionDisconnected-badrun.txt
>
>
> Observed this on ASF CI and internal Mesosphere CI. Affected tests:
> {noformat}
> AgentAPIStreamingTest.AttachInputToNestedContainerSession
> AgentAPITest.LaunchNestedContainerSession
> AgentAPITest.AttachContainerInputAuthorization/0
> AgentAPITest.LaunchNestedContainerSessionWithTTY/0
> AgentAPITest.LaunchNestedContainerSessionDisconnected/1
> {noformat}
> This issue comes at least in three different flavours. Take 
> {{AgentAPIStreamingTest.AttachInputToNestedContainerSession}} as an example.
> h5. Flavour 1
> {noformat}
> ../../src/tests/api_tests.cpp:6473
> Value of: (response).get().status
>   Actual: "503 Service Unavailable"
> Expected: http::OK().status
> Which is: "200 OK"
> Body: ""
> {noformat}
> h5. Flavour 2
> {noformat}
> ../../src/tests/api_tests.cpp:6473
> Value of: (response).get().status
>   Actual: "500 Internal Server Error"
> Expected: http::OK().status
> Which is: "200 OK"
> Body: "Disconnected"
> {noformat}
> h5. Flavour 3
> {noformat}
> /home/ubuntu/workspace/mesos/Mesos_CI-build/FLAG/CMake/label/mesos-ec2-ubuntu-16.04/mesos/src/tests/api_tests.cpp:6367
> Value of: (sessionResponse).get().status
>   Actual: "500 Internal Server Error"
> Expected: http::OK().status
> Which is: "200 OK"
> Body: ""
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7506) Multiple tests leave orphan containers.

2018-01-04 Thread Andrei Budnik (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrei Budnik updated MESOS-7506:
-
Attachment: ReconcileTasksMissingFromSlave-badrun.txt

> Multiple tests leave orphan containers.
> ---
>
> Key: MESOS-7506
> URL: https://issues.apache.org/jira/browse/MESOS-7506
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
> Environment: Ubuntu 16.04
> Fedora 23
> other Linux distros
>Reporter: Alexander Rukletsov
>Assignee: Andrei Budnik
>  Labels: containerizer, flaky-test, mesosphere
> Attachments: KillMultipleTasks-badrun.txt, 
> ROOT_IsolatorFlags-badrun.txt, ReconcileTasksMissingFromSlave-badrun.txt, 
> ResourceLimitation-badrun.txt, ResourceLimitation-badrun2.txt, 
> RestartSlaveRequireExecutorAuthentication-badrun.txt, 
> TaskWithFileURI-badrun.txt
>
>
> I've observed a number of flaky tests that leave orphan containers upon 
> cleanup. A typical log looks like this:
> {noformat}
> ../../src/tests/cluster.cpp:580: Failure
> Value of: containers->empty()
>   Actual: false
> Expected: true
> Failed to destroy containers: { da3e8aa8-98e7-4e72-a8fd-5d0bae960014 }
> {noformat}
> All currently affected tests:
> {noformat}
> SlaveTest.RestartSlaveRequireExecutorAuthentication
> LinuxCapabilitiesIsolatorFlagsTest.ROOT_IsolatorFlags
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-5048) MesosContainerizerSlaveRecoveryTest.ResourceStatistics is flaky

2018-01-04 Thread Andrei Budnik (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrei Budnik updated MESOS-5048:
-
Attachment: ResourceStatistics-badrun3.txt

This log contains a line which might help in finding the root cause:

{code}
../../src/tests/mesos.cpp:889: Failure
(cgroups::destroy(hierarchy, cgroup)).failure(): Failed to remove cgroup 
'/sys/fs/cgroup/memory/mesos_test_e7b0866c-e63a-4a0c-b810-d47d7d059b7c/8f128d76-4d34-4cd3-9dec-154a59d62977':
 Device or resource busy
{code}


> MesosContainerizerSlaveRecoveryTest.ResourceStatistics is flaky
> ---
>
> Key: MESOS-5048
> URL: https://issues.apache.org/jira/browse/MESOS-5048
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Affects Versions: 0.28.0
> Environment: Ubuntu 15.04, Ubuntu 16.04
>Reporter: Jian Qiu
>  Labels: flaky-test
> Attachments: ResourceStatistics-badrun2.txt, 
> ResourceStatistics-badrun3.txt
>
>
> ./mesos-tests.sh 
> --gtest_filter=MesosContainerizerSlaveRecoveryTest.ResourceStatistics 
> --gtest_repeat=100 --gtest_break_on_failure
> This is found in rb, and reproduced in my local machine. There are two types 
> of failures. However, the failure does not appear when enabling verbose...
> {code}
> ../../src/tests/environment.cpp:790: Failure
> Failed
> Tests completed with child processes remaining:
> -+- 1446 /mesos/mesos-0.29.0/_build/src/.libs/lt-mesos-tests 
>  \-+- 9171 sh -c /mesos/mesos-0.29.0/_build/src/mesos-executor 
>\--- 9185 /mesos/mesos-0.29.0/_build/src/.libs/lt-mesos-executor 
> {code}
> And
> {code}
> I0328 15:42:36.982471  5687 exec.cpp:150] Version: 0.29.0
> I0328 15:42:37.008765  5708 exec.cpp:225] Executor registered on slave 
> 731fb93b-26fe-4c7c-a543-fc76f106a62e-S0
> Registered executor on mesos
> ../../src/tests/slave_recovery_tests.cpp:3506: Failure
> Value of: containers.get().size()
>   Actual: 0
> Expected: 1u
> Which is: 1
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-8247) Executor registered message is lost

2017-11-21 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16261281#comment-16261281
 ] 

Andrei Budnik commented on MESOS-8247:
--

Additional logs:
{code}
Nov 14 23:03:21 ip-xxx mesos-agent[2029]: E1114 23:03:21.049590  2057 
process.cpp:2431] Failed to shutdown socket with fd 320: Transport
 endpoint is not connected
Nov 14 23:03:21 ip-xxx mesos-agent[2029]: I1114 23:03:21.049783  2054 
slave.cpp:4484] Got exited event for executor(1)@xx.xx.yy.zzz:10895
{code}

> Executor registered message is lost
> ---
>
> Key: MESOS-8247
> URL: https://issues.apache.org/jira/browse/MESOS-8247
> Project: Mesos
>  Issue Type: Bug
>Reporter: Andrei Budnik
>
> h3. Brief description of successful agent-executor communication.
> Executor sends `RegisterExecutorMessage` message to Agent during 
> initialization step. Agent sends a `ExecutorRegisteredMessage` message as a 
> response to the Executor in `registerExecutor()` method. Whenever executor 
> receives `ExecutorRegisteredMessage`, it prints a `Executor registered on 
> agent...` to stderr logs.
> h3. Problem description.
> The agent launches built-in docker executor, which is stuck in `STAGING` 
> state.
> stderr logs of the docker executor:
> {code}
> I1114 23:03:17.919090 14322 exec.cpp:162] Version: 1.2.3
> {code}
> It doesn't contain a message like `Executor registered on agent...`. At the 
> same time agent received `RegisterExecutorMessage` and sent `runTask` message 
> to the executor.
> stdout logs consists of the same repeating message:
> {code}
> Received killTask for task ...
> {code}
> Also, the docker executor process doesn't contain child processes.
> Currently, executor [doesn't 
> attempt|https://github.com/apache/mesos/blob/2a253093ecdc7d743c9c0874d6e01b68f6a813e4/src/exec/exec.cpp#L320]
>  to launch a task if it is not registered at the agent, while [task 
> killing|https://github.com/apache/mesos/blob/2a253093ecdc7d743c9c0874d6e01b68f6a813e4/src/exec/exec.cpp#L343]
>  doesn't have such a check.
> It looks like `ExecutorRegisteredMessage` has been lost.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-8247) Executor registered message is lost

2017-11-17 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16257291#comment-16257291
 ] 

Andrei Budnik commented on MESOS-8247:
--

Related https://issues.apache.org/jira/browse/MESOS-3851 ?

> Executor registered message is lost
> ---
>
> Key: MESOS-8247
> URL: https://issues.apache.org/jira/browse/MESOS-8247
> Project: Mesos
>  Issue Type: Bug
>Reporter: Andrei Budnik
>
> h3. Brief description of successful agent-executor communication.
> Executor sends `RegisterExecutorMessage` message to Agent during 
> initialization step. Agent sends a `ExecutorRegisteredMessage` message as a 
> response to the Executor in `registerExecutor()` method. Whenever executor 
> receives `ExecutorRegisteredMessage`, it prints a `Executor registered on 
> agent...` to stderr logs.
> h3. Problem description.
> The agent launches built-in docker executor, which is stuck in `STAGING` 
> state.
> stderr logs of the docker executor:
> {code}
> I1114 23:03:17.919090 14322 exec.cpp:162] Version: 1.2.3
> {code}
> It doesn't contain a message like `Executor registered on agent...`. At the 
> same time agent received `RegisterExecutorMessage` and sent `runTask` message 
> to the executor.
> stdout logs consists of the same repeating message:
> {code}
> Received killTask for task ...
> {code}
> Also, the docker executor process doesn't contain child processes.
> Currently, executor [doesn't 
> attempt|https://github.com/apache/mesos/blob/2a253093ecdc7d743c9c0874d6e01b68f6a813e4/src/exec/exec.cpp#L320]
>  to launch a task if it is not registered at the agent, while [task 
> killing|https://github.com/apache/mesos/blob/2a253093ecdc7d743c9c0874d6e01b68f6a813e4/src/exec/exec.cpp#L343]
>  doesn't have such a check.
> It looks like `ExecutorRegisteredMessage` has been lost.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8247) Executor registered message is lost

2017-11-17 Thread Andrei Budnik (JIRA)
Andrei Budnik created MESOS-8247:


 Summary: Executor registered message is lost
 Key: MESOS-8247
 URL: https://issues.apache.org/jira/browse/MESOS-8247
 Project: Mesos
  Issue Type: Bug
Reporter: Andrei Budnik


h3. Brief description of successful agent-executor communication.
Executor sends `RegisterExecutorMessage` message to Agent during initialization 
step. Agent sends a `ExecutorRegisteredMessage` message as a response to the 
Executor in `registerExecutor()` method. Whenever executor receives 
`ExecutorRegisteredMessage`, it prints a `Executor registered on agent...` to 
stderr logs.

h3. Problem description.
The agent launches built-in docker executor, which is stuck in `STAGING` state.
stderr logs of the docker executor:
{code}
I1114 23:03:17.919090 14322 exec.cpp:162] Version: 1.2.3
{code}
It doesn't contain a message like `Executor registered on agent...`. At the 
same time agent received `RegisterExecutorMessage` and sent `runTask` message 
to the executor.

stdout logs consists of the same repeating message:
{code}
Received killTask for task ...
{code}
Also, the docker executor process doesn't contain child processes.

Currently, executor [doesn't 
attempt|https://github.com/apache/mesos/blob/2a253093ecdc7d743c9c0874d6e01b68f6a813e4/src/exec/exec.cpp#L320]
 to launch a task if it is not registered at the agent, while [task 
killing|https://github.com/apache/mesos/blob/2a253093ecdc7d743c9c0874d6e01b68f6a813e4/src/exec/exec.cpp#L343]
 doesn't have such a check.

It looks like `ExecutorRegisteredMessage` has been lost.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-7506) Multiple tests leave orphan containers.

2017-11-17 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16256897#comment-16256897
 ] 

Andrei Budnik commented on MESOS-7506:
--

https://reviews.apache.org/r/63887/
https://reviews.apache.org/r/63888/

> Multiple tests leave orphan containers.
> ---
>
> Key: MESOS-7506
> URL: https://issues.apache.org/jira/browse/MESOS-7506
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
> Environment: Ubuntu 16.04
> Fedora 23
> other Linux distros
>Reporter: Alexander Rukletsov
>Assignee: Andrei Budnik
>  Labels: containerizer, flaky-test, mesosphere
> Attachments: KillMultipleTasks-badrun.txt, 
> ROOT_IsolatorFlags-badrun.txt, ResourceLimitation-badrun.txt, 
> ResourceLimitation-badrun2.txt, 
> RestartSlaveRequireExecutorAuthentication-badrun.txt, 
> TaskWithFileURI-badrun.txt
>
>
> I've observed a number of flaky tests that leave orphan containers upon 
> cleanup. A typical log looks like this:
> {noformat}
> ../../src/tests/cluster.cpp:580: Failure
> Value of: containers->empty()
>   Actual: false
> Expected: true
> Failed to destroy containers: { da3e8aa8-98e7-4e72-a8fd-5d0bae960014 }
> {noformat}
> All currently affected tests:
> {noformat}
> ROOT_DOCKER_DockerAndMesosContainerizers/DefaultExecutorTest.KillTask/0
> ROOT_DOCKER_DockerAndMesosContainerizers/DefaultExecutorTest.TaskWithFileURI/0
> ROOT_DOCKER_DockerAndMesosContainerizers/DefaultExecutorTest.ResourceLimitation/0
> ROOT_DOCKER_DockerAndMesosContainerizers/DefaultExecutorTest.KillMultipleTasks/0
> SlaveTest.RestartSlaveRequireExecutorAuthentication
> LinuxCapabilitiesIsolatorFlagsTest.ROOT_IsolatorFlags
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7506) Multiple tests leave orphan containers.

2017-11-15 Thread Andrei Budnik (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrei Budnik updated MESOS-7506:
-
Attachment: ROOT_IsolatorFlags-badrun.txt

> Multiple tests leave orphan containers.
> ---
>
> Key: MESOS-7506
> URL: https://issues.apache.org/jira/browse/MESOS-7506
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
> Environment: Ubuntu 16.04
> Fedora 23
> other Linux distros
>Reporter: Alexander Rukletsov
>Assignee: Andrei Budnik
>  Labels: containerizer, flaky-test, mesosphere
> Attachments: KillMultipleTasks-badrun.txt, 
> ROOT_IsolatorFlags-badrun.txt, ResourceLimitation-badrun.txt, 
> ResourceLimitation-badrun2.txt, 
> RestartSlaveRequireExecutorAuthentication-badrun.txt, 
> TaskWithFileURI-badrun.txt
>
>
> I've observed a number of flaky tests that leave orphan containers upon 
> cleanup. A typical log looks like this:
> {noformat}
> ../../src/tests/cluster.cpp:580: Failure
> Value of: containers->empty()
>   Actual: false
> Expected: true
> Failed to destroy containers: { da3e8aa8-98e7-4e72-a8fd-5d0bae960014 }
> {noformat}
> All currently affected tests:
> {noformat}
> ROOT_DOCKER_DockerAndMesosContainerizers/DefaultExecutorTest.KillTask/0
> ROOT_DOCKER_DockerAndMesosContainerizers/DefaultExecutorTest.TaskWithFileURI/0
> ROOT_DOCKER_DockerAndMesosContainerizers/DefaultExecutorTest.ResourceLimitation/0
> ROOT_DOCKER_DockerAndMesosContainerizers/DefaultExecutorTest.KillMultipleTasks/0
> SlaveTest.RestartSlaveRequireExecutorAuthentication
> LinuxCapabilitiesIsolatorFlagsTest.ROOT_IsolatorFlags
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-8137) Mesos agent can hang during startup.

2017-11-15 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16253660#comment-16253660
 ] 

Andrei Budnik commented on MESOS-8137:
--

Probably related issues in glibc:
https://bugzilla.redhat.com/show_bug.cgi?id=1332917
https://bugzilla.redhat.com/show_bug.cgi?id=906468

> Mesos agent can hang during startup.
> 
>
> Key: MESOS-8137
> URL: https://issues.apache.org/jira/browse/MESOS-8137
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.4.0
>Reporter: Jie Yu
>Priority: Critical
>
> Environment:
> Linux dcos-agentdisks-as1-1100-2 4.11.0-1011-azure #11-Ubuntu SMP Tue Sep 19 
> 19:03:54 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
> {noformat}
> #0  __lll_lock_wait_private () at 
> ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:95
> #1  0x7f132b856f7b in __malloc_fork_lock_parent () at arena.c:155
> #2  0x7f132b89f5da in __libc_fork () at ../sysdeps/nptl/fork.c:131
> #3  0x7f132b842350 in _IO_new_proc_open (fp=fp@entry=0xf1282b84e0, 
> command=command@entry=0xf1282b6ea8 “logrotate --help > /dev/null”, 
> mode=, mode@entry=0xf1275fb0f2 “r”)
> at iopopen.c:180
> #4  0x7f132b84265c in _IO_new_popen (command=0xf1282b6ea8 “logrotate 
> --help > /dev/null”, mode=0xf1275fb0f2 “r”) at iopopen.c:296
> #5  0x00f1275e622a in Try os::shell<>(std::string 
> const&) ()
> #6  0x7f130fdbae37 in 
> mesos::journald::flags::Flags()::{lambda(std::string 
> const&)#2}::operator()(std::string const&) const (value=..., 
> __closure=)
> at /pkg/src/mesos-modules/journald/lib_journald.hpp:153
> #7  void flags::FlagsBase::add [10], mesos::journald::flags::basic_string()::{lambda(std::string 
> const&)#2}>(std::string mesos::journald::flags::*, flags::Name const&, 
> Option const&, std::string const&, 
> char const (*) [10], 
> mesos::journald::flags::basic_string()::{lambda(std::string 
> const&)#2})::{lambda(flags::FlagsBase const&)#3}::operator()(flags::FlagsBase 
> const) const (base=..., __closure=) at 
> /opt/mesosphere/active/mesos/include/stout/flags/flags.hpp:399
> #8  std::_Function_handler

[jira] [Commented] (MESOS-8137) Mesos agent can hang during startup.

2017-11-14 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16251680#comment-16251680
 ] 

Andrei Budnik commented on MESOS-8137:
--

Do we have stack trace of all the threads?
What is the version of glibc?

> Mesos agent can hang during startup.
> 
>
> Key: MESOS-8137
> URL: https://issues.apache.org/jira/browse/MESOS-8137
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.4.0
>Reporter: Jie Yu
>Priority: Critical
>
> Environment:
> Linux dcos-agentdisks-as1-1100-2 4.11.0-1011-azure #11-Ubuntu SMP Tue Sep 19 
> 19:03:54 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
> {noformat}
> #0  __lll_lock_wait_private () at 
> ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:95
> #1  0x7f132b856f7b in __malloc_fork_lock_parent () at arena.c:155
> #2  0x7f132b89f5da in __libc_fork () at ../sysdeps/nptl/fork.c:131
> #3  0x7f132b842350 in _IO_new_proc_open (fp=fp@entry=0xf1282b84e0, 
> command=command@entry=0xf1282b6ea8 “logrotate --help > /dev/null”, 
> mode=, mode@entry=0xf1275fb0f2 “r”)
> at iopopen.c:180
> #4  0x7f132b84265c in _IO_new_popen (command=0xf1282b6ea8 “logrotate 
> --help > /dev/null”, mode=0xf1275fb0f2 “r”) at iopopen.c:296
> #5  0x00f1275e622a in Try os::shell<>(std::string 
> const&) ()
> #6  0x7f130fdbae37 in 
> mesos::journald::flags::Flags()::{lambda(std::string 
> const&)#2}::operator()(std::string const&) const (value=..., 
> __closure=)
> at /pkg/src/mesos-modules/journald/lib_journald.hpp:153
> #7  void flags::FlagsBase::add [10], mesos::journald::flags::basic_string()::{lambda(std::string 
> const&)#2}>(std::string mesos::journald::flags::*, flags::Name const&, 
> Option const&, std::string const&, 
> char const (*) [10], 
> mesos::journald::flags::basic_string()::{lambda(std::string 
> const&)#2})::{lambda(flags::FlagsBase const&)#3}::operator()(flags::FlagsBase 
> const) const (base=..., __closure=) at 
> /opt/mesosphere/active/mesos/include/stout/flags/flags.hpp:399
> #8  std::_Function_handler

[jira] [Assigned] (MESOS-8172) Agent --authenticate_http_executors commandline flag unrecognized in 1.4.0

2017-11-08 Thread Andrei Budnik (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrei Budnik reassigned MESOS-8172:


Assignee: Greg Mann

> Agent --authenticate_http_executors commandline flag unrecognized in 1.4.0
> --
>
> Key: MESOS-8172
> URL: https://issues.apache.org/jira/browse/MESOS-8172
> Project: Mesos
>  Issue Type: Bug
>  Components: executor, security
>Affects Versions: 1.4.0
> Environment: Ubuntu 16.04.3 with meso 1.4.0 compiled from source 
> tarball.
>Reporter: Dan Leary
>Assignee: Greg Mann
>
> Apparently the mesos-agent authenticate_http_executors commandline arg was 
> introduced in 1.3.0 by MESOS-6365.   But running "mesos-agent 
> --authenticate_http_executors ..." in 1.4.0 yields
> {noformat}
> Failed to load unknown flag 'authenticate_http_executors'
> {noformat}
> ...followed by a usage report that does not include 
> "--authenticate_http_executors".
> Presumably this means executor authentication is no longer configurable.
> It is still documented at 
> https://mesos.apache.org/documentation/latest/authentication/#agent



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Issue Comment Deleted] (MESOS-7082) ROOT_DOCKER_DockerAndMesosContainerizers/DefaultExecutorTest.KillTask/0 is flaky.

2017-11-08 Thread Andrei Budnik (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrei Budnik updated MESOS-7082:
-
Comment: was deleted

(was: 
https://issues.apache.org/jira/browse/MESOS-7506?focusedCommentId=16243729)

> ROOT_DOCKER_DockerAndMesosContainerizers/DefaultExecutorTest.KillTask/0 is 
> flaky.
> -
>
> Key: MESOS-7082
> URL: https://issues.apache.org/jira/browse/MESOS-7082
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.2.0
> Environment: ubuntu 16.04 with/without SSL
> Fedora 23
>Reporter: Anand Mazumdar
>Priority: Critical
>  Labels: flaky, flaky-test, mesosphere
>
> Showed up on our internal CI
> {noformat}
> 07:00:17 [ RUN  ] 
> ROOT_DOCKER_DockerAndMesosContainerizers/DefaultExecutorTest.KillTask/0
> 07:00:17 I0207 07:00:17.775459  2952 cluster.cpp:160] Creating default 
> 'local' authorizer
> 07:00:17 I0207 07:00:17.776511  2970 master.cpp:383] Master 
> fa1554c4-572a-4b89-8994-a89460f588d3 (ip-10-153-254-29.ec2.internal) started 
> on 10.153.254.29:38570
> 07:00:17 I0207 07:00:17.776538  2970 master.cpp:385] Flags at startup: 
> --acls="" --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
> --allocation_interval="1secs" --allocator="HierarchicalDRF" 
> --authenticate_agents="true" --authenticate_frameworks="true" 
> --authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
> --authenticate_http_readwrite="true" --authenticators="crammd5" 
> --authorizers="local" --credentials="/tmp/ZROfJk/credentials" 
> --framework_sorter="drf" --help="false" --hostname_lookup="true" 
> --http_authenticators="basic" --http_framework_authenticators="basic" 
> --initialize_driver_logging="true" --log_auto_initialize="true" 
> --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" 
> --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" 
> --max_unreachable_tasks_per_framework="1000" --quiet="false" 
> --recovery_agent_removal_limit="100%" --registry="in_memory" 
> --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
> --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
> --registry_store_timeout="100secs" --registry_strict="false" 
> --root_submissions="true" --user_sorter="drf" --version="false" 
> --webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/ZROfJk/master" 
> --zk_session_timeout="10secs"
> 07:00:17 I0207 07:00:17.776674  2970 master.cpp:435] Master only allowing 
> authenticated frameworks to register
> 07:00:17 I0207 07:00:17.776687  2970 master.cpp:449] Master only allowing 
> authenticated agents to register
> 07:00:17 I0207 07:00:17.776695  2970 master.cpp:462] Master only allowing 
> authenticated HTTP frameworks to register
> 07:00:17 I0207 07:00:17.776703  2970 credentials.hpp:37] Loading credentials 
> for authentication from '/tmp/ZROfJk/credentials'
> 07:00:17 I0207 07:00:17.776779  2970 master.cpp:507] Using default 'crammd5' 
> authenticator
> 07:00:17 I0207 07:00:17.776841  2970 http.cpp:919] Using default 'basic' HTTP 
> authenticator for realm 'mesos-master-readonly'
> 07:00:17 I0207 07:00:17.776919  2970 http.cpp:919] Using default 'basic' HTTP 
> authenticator for realm 'mesos-master-readwrite'
> 07:00:17 I0207 07:00:17.776970  2970 http.cpp:919] Using default 'basic' HTTP 
> authenticator for realm 'mesos-master-scheduler'
> 07:00:17 I0207 07:00:17.777009  2970 master.cpp:587] Authorization enabled
> 07:00:17 I0207 07:00:17.777122  2975 hierarchical.cpp:161] Initialized 
> hierarchical allocator process
> 07:00:17 I0207 07:00:17.777138  2974 whitelist_watcher.cpp:77] No whitelist 
> given
> 07:00:17 I0207 07:00:17.04  2976 master.cpp:2123] Elected as the leading 
> master!
> 07:00:17 I0207 07:00:17.26  2976 master.cpp:1645] Recovering from 
> registrar
> 07:00:17 I0207 07:00:17.84  2975 registrar.cpp:329] Recovering registrar
> 07:00:17 I0207 07:00:17.777989  2973 registrar.cpp:362] Successfully fetched 
> the registry (0B) in 176384ns
> 07:00:17 I0207 07:00:17.778023  2973 registrar.cpp:461] Applied 1 operations 
> in 7573ns; attempting to update the registry
> 07:00:17 I0207 07:00:17.778249  2976 registrar.cpp:506] Successfully updated 
> the registry in 210944ns
> 07:00:17 I0207 07:00:17.778290  2976 registrar.cpp:392] Successfully 
> recovered registrar
> 07:00:17 I0207 07:00:17.778373  2976 master.cpp:1761] Recovered 0 agents from 
> the registry (172B); allowing 10mins for agents to re-register
> 07:00:17 I0207 07:00:17.778394  2974 hierarchical.cpp:188] Skipping recovery 
> of hierarchical allocator: nothing to recover
> 07:00:17 I0207 07:00:17.869381  2952 containerizer.cpp:220] Using isolation: 
> posix/cpu,posix/mem,filesystem/posix,network/cni
> 07:00:17 I0207 

[jira] [Commented] (MESOS-7082) ROOT_DOCKER_DockerAndMesosContainerizers/DefaultExecutorTest.KillTask/0 is flaky.

2017-11-08 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16243738#comment-16243738
 ] 

Andrei Budnik commented on MESOS-7082:
--

https://issues.apache.org/jira/browse/MESOS-7506?focusedCommentId=16243729

> ROOT_DOCKER_DockerAndMesosContainerizers/DefaultExecutorTest.KillTask/0 is 
> flaky.
> -
>
> Key: MESOS-7082
> URL: https://issues.apache.org/jira/browse/MESOS-7082
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.2.0
> Environment: ubuntu 16.04 with/without SSL
> Fedora 23
>Reporter: Anand Mazumdar
>Priority: Critical
>  Labels: flaky, flaky-test, mesosphere
>
> Showed up on our internal CI
> {noformat}
> 07:00:17 [ RUN  ] 
> ROOT_DOCKER_DockerAndMesosContainerizers/DefaultExecutorTest.KillTask/0
> 07:00:17 I0207 07:00:17.775459  2952 cluster.cpp:160] Creating default 
> 'local' authorizer
> 07:00:17 I0207 07:00:17.776511  2970 master.cpp:383] Master 
> fa1554c4-572a-4b89-8994-a89460f588d3 (ip-10-153-254-29.ec2.internal) started 
> on 10.153.254.29:38570
> 07:00:17 I0207 07:00:17.776538  2970 master.cpp:385] Flags at startup: 
> --acls="" --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
> --allocation_interval="1secs" --allocator="HierarchicalDRF" 
> --authenticate_agents="true" --authenticate_frameworks="true" 
> --authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
> --authenticate_http_readwrite="true" --authenticators="crammd5" 
> --authorizers="local" --credentials="/tmp/ZROfJk/credentials" 
> --framework_sorter="drf" --help="false" --hostname_lookup="true" 
> --http_authenticators="basic" --http_framework_authenticators="basic" 
> --initialize_driver_logging="true" --log_auto_initialize="true" 
> --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" 
> --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" 
> --max_unreachable_tasks_per_framework="1000" --quiet="false" 
> --recovery_agent_removal_limit="100%" --registry="in_memory" 
> --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
> --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
> --registry_store_timeout="100secs" --registry_strict="false" 
> --root_submissions="true" --user_sorter="drf" --version="false" 
> --webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/ZROfJk/master" 
> --zk_session_timeout="10secs"
> 07:00:17 I0207 07:00:17.776674  2970 master.cpp:435] Master only allowing 
> authenticated frameworks to register
> 07:00:17 I0207 07:00:17.776687  2970 master.cpp:449] Master only allowing 
> authenticated agents to register
> 07:00:17 I0207 07:00:17.776695  2970 master.cpp:462] Master only allowing 
> authenticated HTTP frameworks to register
> 07:00:17 I0207 07:00:17.776703  2970 credentials.hpp:37] Loading credentials 
> for authentication from '/tmp/ZROfJk/credentials'
> 07:00:17 I0207 07:00:17.776779  2970 master.cpp:507] Using default 'crammd5' 
> authenticator
> 07:00:17 I0207 07:00:17.776841  2970 http.cpp:919] Using default 'basic' HTTP 
> authenticator for realm 'mesos-master-readonly'
> 07:00:17 I0207 07:00:17.776919  2970 http.cpp:919] Using default 'basic' HTTP 
> authenticator for realm 'mesos-master-readwrite'
> 07:00:17 I0207 07:00:17.776970  2970 http.cpp:919] Using default 'basic' HTTP 
> authenticator for realm 'mesos-master-scheduler'
> 07:00:17 I0207 07:00:17.777009  2970 master.cpp:587] Authorization enabled
> 07:00:17 I0207 07:00:17.777122  2975 hierarchical.cpp:161] Initialized 
> hierarchical allocator process
> 07:00:17 I0207 07:00:17.777138  2974 whitelist_watcher.cpp:77] No whitelist 
> given
> 07:00:17 I0207 07:00:17.04  2976 master.cpp:2123] Elected as the leading 
> master!
> 07:00:17 I0207 07:00:17.26  2976 master.cpp:1645] Recovering from 
> registrar
> 07:00:17 I0207 07:00:17.84  2975 registrar.cpp:329] Recovering registrar
> 07:00:17 I0207 07:00:17.777989  2973 registrar.cpp:362] Successfully fetched 
> the registry (0B) in 176384ns
> 07:00:17 I0207 07:00:17.778023  2973 registrar.cpp:461] Applied 1 operations 
> in 7573ns; attempting to update the registry
> 07:00:17 I0207 07:00:17.778249  2976 registrar.cpp:506] Successfully updated 
> the registry in 210944ns
> 07:00:17 I0207 07:00:17.778290  2976 registrar.cpp:392] Successfully 
> recovered registrar
> 07:00:17 I0207 07:00:17.778373  2976 master.cpp:1761] Recovered 0 agents from 
> the registry (172B); allowing 10mins for agents to re-register
> 07:00:17 I0207 07:00:17.778394  2974 hierarchical.cpp:188] Skipping recovery 
> of hierarchical allocator: nothing to recover
> 07:00:17 I0207 07:00:17.869381  2952 containerizer.cpp:220] Using isolation: 
> posix/cpu,posix/mem,filesystem/posix,network/cni
> 07:00:17 I0207 

[jira] [Commented] (MESOS-7506) Multiple tests leave orphan containers.

2017-11-08 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16243729#comment-16243729
 ] 

Andrei Budnik commented on MESOS-7506:
--

*Second cause*

{{[ROOT_DOCKER_DockerAndMesosContainerizers/DefaultExecutorTest.TaskWithFileURI/0|https://github.com/apache/mesos/blob/e5f1115739758163d95110960dd829e65cbbebff/src/tests/default_executor_tests.cpp#L1912]}}
 launches task group, so each task is launched using {{ComposingContainerizer}}.
When this test completes (after receiving TASK_FINISHED status update), Slave 
d-tor is called, where [it 
waits|https://github.com/apache/mesos/blob/e5f1115739758163d95110960dd829e65cbbebff/src/tests/cluster.cpp#L574]
 for each container to trigger a [container's termination 
future|https://github.com/apache/mesos/blob/e5f1115739758163d95110960dd829e65cbbebff/src/slave/containerizer/mesos/containerizer.cpp#L2528].
As this test uses {{ComposingContainerizer}}, [calling 
destroy|https://github.com/apache/mesos/blob/e5f1115739758163d95110960dd829e65cbbebff/src/tests/cluster.cpp#L572]
 for a container means {{ComposingContainerizer}} subscribes for the same 
[container's termination 
future|https://github.com/apache/mesos/blob/e5f1115739758163d95110960dd829e65cbbebff/src/slave/containerizer/composing.cpp#L638-L647]
 via {{onAny}} method. Once this future is triggered, the lambda function is 
dispatched. This lambda removes {{containerId}} from the hash set.

When a container's termination future is set [is 
set|https://github.com/apache/mesos/blob/e5f1115739758163d95110960dd829e65cbbebff/3rdparty/libprocess/include/process/future.hpp#L1524],
 then 
{{[AWAIT(wait)|https://github.com/apache/mesos/blob/e5f1115739758163d95110960dd829e65cbbebff/src/tests/cluster.cpp#L574]}}
 might [be 
satisfied|https://github.com/apache/mesos/blob/e5f1115739758163d95110960dd829e65cbbebff/3rdparty/libprocess/include/process/gtest.hpp#L83],
 hence container's hash set will be [requested 
(dispatched)|https://github.com/apache/mesos/blob/e5f1115739758163d95110960dd829e65cbbebff/src/tests/cluster.cpp#L577].
 There is a race between a thread which sets the container's termination 
future, calling {{onReadyCallbacks}} and {{onAnyCallbacks}}, where calling 
{{onAnyCallbacks}} leads to dispatching aforementioned lambda, and a test 
thread which waits for the container's termination future and then calls 
{{containerizer->containers()}}.

To reproduce this case, we need to add one sleep for ~10ms before 
[internal::run(copy->onAnyCallbacks, 
*this)|https://github.com/apache/mesos/blob/e5f1115739758163d95110960dd829e65cbbebff/3rdparty/libprocess/include/process/future.hpp#L1537]
 and remove another sleep from [process::internal::await 
|https://github.com/apache/mesos/blob/e5f1115739758163d95110960dd829e65cbbebff/3rdparty/libprocess/include/process/gtest.hpp#L92].

> Multiple tests leave orphan containers.
> ---
>
> Key: MESOS-7506
> URL: https://issues.apache.org/jira/browse/MESOS-7506
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
> Environment: Ubuntu 16.04
> Fedora 23
> other Linux distros
>Reporter: Alexander Rukletsov
>Assignee: Andrei Budnik
>  Labels: containerizer, flaky-test, mesosphere
> Attachments: KillMultipleTasks-badrun.txt, 
> ResourceLimitation-badrun.txt, TaskWithFileURI-badrun.txt
>
>
> I've observed a number of flaky tests that leave orphan containers upon 
> cleanup. A typical log looks like this:
> {noformat}
> ../../src/tests/cluster.cpp:580: Failure
> Value of: containers->empty()
>   Actual: false
> Expected: true
> Failed to destroy containers: { da3e8aa8-98e7-4e72-a8fd-5d0bae960014 }
> {noformat}
> All currently affected tests:
> {noformat}
> ROOT_DOCKER_DockerAndMesosContainerizers/DefaultExecutorTest.KillTask/0
> ROOT_DOCKER_DockerAndMesosContainerizers/DefaultExecutorTest.TaskWithFileURI/0
> ROOT_DOCKER_DockerAndMesosContainerizers/DefaultExecutorTest.ResourceLimitation/0
> ROOT_DOCKER_DockerAndMesosContainerizers/DefaultExecutorTest.KillMultipleTasks/0
> SlaveRecoveryTest/0.RecoverUnregisteredExecutor
> SlaveRecoveryTest/0.CleanupExecutor
> SlaveRecoveryTest/0.RecoverTerminatedExecutor
> SlaveTest.ShutdownUnregisteredExecutor
> ShutdownUnregisteredExecutor
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (MESOS-7506) Multiple tests leave orphan containers.

2017-11-08 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16225515#comment-16225515
 ] 

Andrei Budnik edited comment on MESOS-7506 at 11/8/17 11:01 AM:


*First cause*

Some tests (from {{SlaveTest}} and {{SlaveRecoveryTest}}) have a pattern [like 
this|https://github.com/apache/mesos/blob/ff01d0c44251e2ffaa2f4f47b33c790594d194d9/src/tests/slave_tests.cpp#L393-L406],
 where the clock is advanced by {{executor_registration_timeout}} and then it 
waits in a loop until a task status update is sent. This loop is executing 
while the container is being destroyed. At the same time, container destruction 
consists of multiple steps, one of them waits for [cgroups 
destruction|https://github.com/apache/mesos/blob/ff01d0c44251e2ffaa2f4f47b33c790594d194d9/src/slave/containerizer/mesos/linux_launcher.cpp#L567].
 That means, we have a race between container destruction process and the loop 
that advances the clock, leading to the following outcomes:
#  Container completely destroyed, before clock advancing reaches timeout (e.g. 
{{cgroups::DESTROY_TIMEOUT}}).
# Triggered timeout due to clock advancing, before container destruction 
completes. That results in [leaving 
orphaned|https://github.com/apache/mesos/blob/ff01d0c44251e2ffaa2f4f47b33c790594d194d9/src/slave/containerizer/mesos/containerizer.cpp#L2367-L2380]
 containers that will be detected by [Slave 
destructor|https://github.com/apache/mesos/blob/ff01d0c44251e2ffaa2f4f47b33c790594d194d9/src/tests/cluster.cpp#L559-L584]
 in `tests/cluster.cpp`, so the test will fail.

The issue is easily reproduced by advancing the clocks by 60 seconds or more in 
the loop, which waits for a status update.


was (Author: abudnik):
Some tests (from {{SlaveTest}} and {{SlaveRecoveryTest}}) have a pattern [like 
this|https://github.com/apache/mesos/blob/ff01d0c44251e2ffaa2f4f47b33c790594d194d9/src/tests/slave_tests.cpp#L393-L406],
 where the clock is advanced by {{executor_registration_timeout}} and then it 
waits in a loop until a task status update is sent. This loop is executing 
while the container is being destroyed. At the same time, container destruction 
consists of multiple steps, one of them waits for [cgroups 
destruction|https://github.com/apache/mesos/blob/ff01d0c44251e2ffaa2f4f47b33c790594d194d9/src/slave/containerizer/mesos/linux_launcher.cpp#L567].
 That means, we have a race between container destruction process and the loop 
that advances the clock, leading to the following outcomes:
#  Container completely destroyed, before clock advancing reaches timeout (e.g. 
{{cgroups::DESTROY_TIMEOUT}}).
# Triggered timeout due to clock advancing, before container destruction 
completes. That results in [leaving 
orphaned|https://github.com/apache/mesos/blob/ff01d0c44251e2ffaa2f4f47b33c790594d194d9/src/slave/containerizer/mesos/containerizer.cpp#L2367-L2380]
 containers that will be detected by [Slave 
destructor|https://github.com/apache/mesos/blob/ff01d0c44251e2ffaa2f4f47b33c790594d194d9/src/tests/cluster.cpp#L559-L584]
 in `tests/cluster.cpp`, so the test will fail.

The issue is easily reproduced by advancing the clocks by 60 seconds or more in 
the loop, which waits for a status update.

> Multiple tests leave orphan containers.
> ---
>
> Key: MESOS-7506
> URL: https://issues.apache.org/jira/browse/MESOS-7506
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
> Environment: Ubuntu 16.04
> Fedora 23
> other Linux distros
>Reporter: Alexander Rukletsov
>Assignee: Andrei Budnik
>  Labels: containerizer, flaky-test, mesosphere
> Attachments: KillMultipleTasks-badrun.txt, 
> ResourceLimitation-badrun.txt, TaskWithFileURI-badrun.txt
>
>
> I've observed a number of flaky tests that leave orphan containers upon 
> cleanup. A typical log looks like this:
> {noformat}
> ../../src/tests/cluster.cpp:580: Failure
> Value of: containers->empty()
>   Actual: false
> Expected: true
> Failed to destroy containers: { da3e8aa8-98e7-4e72-a8fd-5d0bae960014 }
> {noformat}
> All currently affected tests:
> {noformat}
> ROOT_DOCKER_DockerAndMesosContainerizers/DefaultExecutorTest.KillTask/0
> ROOT_DOCKER_DockerAndMesosContainerizers/DefaultExecutorTest.TaskWithFileURI/0
> ROOT_DOCKER_DockerAndMesosContainerizers/DefaultExecutorTest.ResourceLimitation/0
> ROOT_DOCKER_DockerAndMesosContainerizers/DefaultExecutorTest.KillMultipleTasks/0
> SlaveRecoveryTest/0.RecoverUnregisteredExecutor
> SlaveRecoveryTest/0.CleanupExecutor
> SlaveRecoveryTest/0.RecoverTerminatedExecutor
> SlaveTest.ShutdownUnregisteredExecutor
> ShutdownUnregisteredExecutor
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-7506) Multiple tests leave orphan containers.

2017-11-07 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16241916#comment-16241916
 ] 

Andrei Budnik commented on MESOS-7506:
--

https://reviews.apache.org/r/63589/

> Multiple tests leave orphan containers.
> ---
>
> Key: MESOS-7506
> URL: https://issues.apache.org/jira/browse/MESOS-7506
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
> Environment: Ubuntu 16.04
> Fedora 23
> other Linux distros
>Reporter: Alexander Rukletsov
>Assignee: Andrei Budnik
>  Labels: containerizer, flaky-test, mesosphere
> Attachments: ResourceLimitation-badrun.txt, TaskWithFileURI-badrun.txt
>
>
> I've observed a number of flaky tests that leave orphan containers upon 
> cleanup. A typical log looks like this:
> {noformat}
> ../../src/tests/cluster.cpp:580: Failure
> Value of: containers->empty()
>   Actual: false
> Expected: true
> Failed to destroy containers: { da3e8aa8-98e7-4e72-a8fd-5d0bae960014 }
> {noformat}
> All currently affected tests:
> {noformat}
> ROOT_DOCKER_DockerAndMesosContainerizers/DefaultExecutorTest.KillTask/0
> ROOT_DOCKER_DockerAndMesosContainerizers/DefaultExecutorTest.TaskWithFileURI/0
> ROOT_DOCKER_DockerAndMesosContainerizers/DefaultExecutorTest.ResourceLimitation/0
> SlaveRecoveryTest/0.RecoverUnregisteredExecutor
> SlaveRecoveryTest/0.CleanupExecutor
> SlaveRecoveryTest/0.RecoverTerminatedExecutor
> SlaveTest.ShutdownUnregisteredExecutor
> ShutdownUnregisteredExecutor
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-7506) Multiple tests leave orphan containers.

2017-10-30 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16225515#comment-16225515
 ] 

Andrei Budnik commented on MESOS-7506:
--

Some tests (from {{SlaveTest}} and {{SlaveRecoveryTest}}) have a pattern [like 
this|https://github.com/apache/mesos/blob/ff01d0c44251e2ffaa2f4f47b33c790594d194d9/src/tests/slave_tests.cpp#L393-L406],
 where the clock is advanced by {{executor_registration_timeout}} and then it 
waits in a loop until a task status update is sent. This loop is executing 
while the container is being destroyed. At the same time, container destruction 
consists of multiple steps, one of them waits for [cgroups 
destruction|https://github.com/apache/mesos/blob/ff01d0c44251e2ffaa2f4f47b33c790594d194d9/src/slave/containerizer/mesos/linux_launcher.cpp#L567].
 That means, we have a race between container destruction process and the loop 
that advances the clock, leading to the following outcomes:
#  Container completely destroyed, before clock advancing reaches timeout (e.g. 
{{cgroups::DESTROY_TIMEOUT}}).
# Triggered timeout due to clock advancing, before container destruction 
completes. That results in [leaving 
orphaned|https://github.com/apache/mesos/blob/ff01d0c44251e2ffaa2f4f47b33c790594d194d9/src/slave/containerizer/mesos/containerizer.cpp#L2367-L2380]
 containers that will be detected by [Slave 
destructor|https://github.com/apache/mesos/blob/ff01d0c44251e2ffaa2f4f47b33c790594d194d9/src/tests/cluster.cpp#L559-L584]
 in `tests/cluster.cpp`, so the test will fail.

The issue is easily reproduced by advancing the clocks by 60 seconds or more in 
the loop, which waits for a status update.

> Multiple tests leave orphan containers.
> ---
>
> Key: MESOS-7506
> URL: https://issues.apache.org/jira/browse/MESOS-7506
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
> Environment: Ubuntu 16.04
> Fedora 23
> other Linux distros
>Reporter: Alexander Rukletsov
>Assignee: Andrei Budnik
>  Labels: containerizer, flaky-test, mesosphere
>
> I've observed a number of flaky tests that leave orphan containers upon 
> cleanup. A typical log looks like this:
> {noformat}
> ../../src/tests/cluster.cpp:580: Failure
> Value of: containers->empty()
>   Actual: false
> Expected: true
> Failed to destroy containers: { da3e8aa8-98e7-4e72-a8fd-5d0bae960014 }
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (MESOS-7506) Multiple tests leave orphan containers.

2017-10-20 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16212605#comment-16212605
 ] 

Andrei Budnik edited comment on MESOS-7506 at 10/20/17 6:30 PM:


Bug has been reproduced with extra debug logs 
(SlaveTest.ShutdownUnregisteredExecutor):
{code}
I1020 17:59:05.049862 16817 cgroups.cpp:1563] TasksKiller::freeze: 
/sys/fs/cgroup/freezer/mesos/57152291-a86c-446d-8bd3-eb5c60dfefd7
I1020 17:59:05.049876 16817 cgroups.cpp:3085] Freezing cgroup 
/sys/fs/cgroup/freezer/mesos/57152291-a86c-446d-8bd3-eb5c60dfefd7
I1020 17:59:05.050351 16817 cgroups.cpp:1398] Freezer::freeze: 
/sys/fs/cgroup/freezer/mesos/57152291-a86c-446d-8bd3-eb5c60dfefd7
I1020 17:59:05.051440 16817 cgroups.cpp:1423] Freezer::freeze 2: 
/sys/fs/cgroup/freezer/mesos/57152291-a86c-446d-8bd3-eb5c60dfefd7: FREEZING
I1020 17:59:05.051749 16819 cgroups.cpp:1398] Freezer::freeze: 
/sys/fs/cgroup/freezer/mesos/57152291-a86c-446d-8bd3-eb5c60dfefd7
I1020 17:59:05.052760 16819 cgroups.cpp:1416] Successfully froze cgroup 
/sys/fs/cgroup/freezer/mesos/57152291-a86c-446d-8bd3-eb5c60dfefd7 after 1secs
I1020 17:59:05.052858 16819 hierarchical.cpp:1488] Performed allocation for 1 
agents in 15715ns
I1020 17:59:05.052901 16819 cgroups.cpp:1574] TasksKiller::kill: 
/sys/fs/cgroup/freezer/mesos/57152291-a86c-446d-8bd3-eb5c60dfefd7
I1020 17:59:05.053357 16819 cgroups.cpp:1584] TasksKiller::kill: reap: 31229
I1020 17:59:05.053377 16819 cgroups.cpp:1584] TasksKiller::kill: reap: 31243
I1020 17:59:05.054193 16819 cgroups.cpp:928] cgroups::kill: 31229
I1020 17:59:05.054206 16819 cgroups.cpp:928] cgroups::kill: 31243
I1020 17:59:05.054262 16819 cgroups.cpp:1598] TasksKiller::thaw: 
/sys/fs/cgroup/freezer/mesos/57152291-a86c-446d-8bd3-eb5c60dfefd7
I1020 17:59:05.054272 16819 cgroups.cpp:3103] Thawing cgroup 
/sys/fs/cgroup/freezer/mesos/57152291-a86c-446d-8bd3-eb5c60dfefd7
I1020 17:59:05.054757 16819 cgroups.cpp:1432] Freezer::thaw: 
/sys/fs/cgroup/freezer/mesos/57152291-a86c-446d-8bd3-eb5c60dfefd7
I1020 17:59:05.057647 16819 cgroups.cpp:1449] Successfully thawed cgroup 
/sys/fs/cgroup/freezer/mesos/57152291-a86c-446d-8bd3-eb5c60dfefd7 after 0ns
I1020 17:59:05.057842 16816 cgroups.cpp:1604] TasksKiller::reap: 
/sys/fs/cgroup/freezer/mesos/57152291-a86c-446d-8bd3-eb5c60dfefd7
{code}
{{TasksKiller::finished}} wasn't called, while {{TasksKiller::reap}} was called.


was (Author: abudnik):
Bug has been reproduced with extra debug logs 
(SlaveTest.ShutdownUnregisteredExecutor):
{code}
I1020 12:07:20.266032  9274 containerizer.cpp:2220] Destroying container 
7f9cb5a6-26c9-4010-ace9-b9cb3e065542 in RUNNING state
I1020 12:07:20.266042  9274 containerizer.cpp:2784] Transitioning the state of 
container 7f9cb5a6-26c9-4010-ace9-b9cb3e065542 from RUNNING to DESTROYING
I1020 12:07:20.266175  9274 linux_launcher.cpp:514] Asked to destroy container 
7f9cb5a6-26c9-4010-ace9-b9cb3e065542
I1020 12:07:20.266717  9274 linux_launcher.cpp:560] Using freezer to destroy 
cgroup mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542
I1020 12:07:20.268649  9274 cgroups.cpp:1562] TasksKiller::freeze: 
/sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542
I1020 12:07:20.268756  9274 cgroups.cpp:3083] Freezing cgroup 
/sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542
I1020 12:07:20.269533  9276 cgroups.cpp:1397] Freezer::freeze: 
/sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542
I1020 12:07:20.270486  9276 cgroups.cpp:1422] Freezer::freeze 2: 
/sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542: FREEZING
I1020 12:07:20.270725  9272 cgroups.cpp:1397] Freezer::freeze: 
/sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542
I1020 12:07:20.271625  9272 cgroups.cpp:1415] Successfully froze cgroup 
/sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542 after 1secs
I1020 12:07:20.271724  9272 hierarchical.cpp:1488] Performed allocation for 1 
agents in 18541ns
I1020 12:07:20.271767  9272 cgroups.cpp:1573] TasksKiller::kill: 
/sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542
I1020 12:07:20.273386  9272 cgroups.cpp:1596] TasksKiller::thaw: 
/sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542
I1020 12:07:20.273486  9272 cgroups.cpp:3101] Thawing cgroup 
/sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542
I1020 12:07:20.274129  9272 cgroups.cpp:1431] Freezer::thaw: 
/sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542
I1020 12:07:20.276964  9272 cgroups.cpp:1448] Successfully thawed cgroup 
/sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542 after 0ns
I1020 12:07:20.277225  9277 cgroups.cpp:1602] TasksKiller::reap: 
/sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542
I1020 12:07:20.277613  9279 hierarchical.cpp:1488] Performed allocation for 1 
agents in 17680ns
I1020 12:07:20.22  9279 containerizer.cpp:2671] Container 

[jira] [Comment Edited] (MESOS-7506) Multiple tests leave orphan containers.

2017-10-20 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16212605#comment-16212605
 ] 

Andrei Budnik edited comment on MESOS-7506 at 10/20/17 6:28 PM:


Bug has been reproduced with extra debug logs 
(SlaveTest.ShutdownUnregisteredExecutor):
{code}
I1020 12:07:20.266032  9274 containerizer.cpp:2220] Destroying container 
7f9cb5a6-26c9-4010-ace9-b9cb3e065542 in RUNNING state
I1020 12:07:20.266042  9274 containerizer.cpp:2784] Transitioning the state of 
container 7f9cb5a6-26c9-4010-ace9-b9cb3e065542 from RUNNING to DESTROYING
I1020 12:07:20.266175  9274 linux_launcher.cpp:514] Asked to destroy container 
7f9cb5a6-26c9-4010-ace9-b9cb3e065542
I1020 12:07:20.266717  9274 linux_launcher.cpp:560] Using freezer to destroy 
cgroup mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542
I1020 12:07:20.268649  9274 cgroups.cpp:1562] TasksKiller::freeze: 
/sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542
I1020 12:07:20.268756  9274 cgroups.cpp:3083] Freezing cgroup 
/sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542
I1020 12:07:20.269533  9276 cgroups.cpp:1397] Freezer::freeze: 
/sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542
I1020 12:07:20.270486  9276 cgroups.cpp:1422] Freezer::freeze 2: 
/sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542: FREEZING
I1020 12:07:20.270725  9272 cgroups.cpp:1397] Freezer::freeze: 
/sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542
I1020 12:07:20.271625  9272 cgroups.cpp:1415] Successfully froze cgroup 
/sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542 after 1secs
I1020 12:07:20.271724  9272 hierarchical.cpp:1488] Performed allocation for 1 
agents in 18541ns
I1020 12:07:20.271767  9272 cgroups.cpp:1573] TasksKiller::kill: 
/sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542
I1020 12:07:20.273386  9272 cgroups.cpp:1596] TasksKiller::thaw: 
/sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542
I1020 12:07:20.273486  9272 cgroups.cpp:3101] Thawing cgroup 
/sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542
I1020 12:07:20.274129  9272 cgroups.cpp:1431] Freezer::thaw: 
/sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542
I1020 12:07:20.276964  9272 cgroups.cpp:1448] Successfully thawed cgroup 
/sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542 after 0ns
I1020 12:07:20.277225  9277 cgroups.cpp:1602] TasksKiller::reap: 
/sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542
I1020 12:07:20.277613  9279 hierarchical.cpp:1488] Performed allocation for 1 
agents in 17680ns
I1020 12:07:20.22  9279 containerizer.cpp:2671] Container 
7f9cb5a6-26c9-4010-ace9-b9cb3e065542 has exited
{code}
{{TasksKiller::finished}} wasn't called, while {{TasksKiller::reap}} was called.


was (Author: abudnik):
Bug has been reproduced with extra debug logs 
(SlaveTest.ShutdownUnregisteredExecutor):
{code}
I1020 12:07:20.266032  9274 containerizer.cpp:2220] Destroying container 
7f9cb5a6-26c9-4010-ace9-b9cb3e065542 in RUNNING state
I1020 12:07:20.266042  9274 containerizer.cpp:2784] Transitioning the state of 
container 7f9cb5a6-26c9-4010-ace9-b9cb3e065542 from RUNNING to DESTROYING
I1020 12:07:20.266175  9274 linux_launcher.cpp:514] Asked to destroy container 
7f9cb5a6-26c9-4010-ace9-b9cb3e065542
I1020 12:07:20.266717  9274 linux_launcher.cpp:560] Using freezer to destroy 
cgroup mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542
I1020 12:07:20.268649  9274 cgroups.cpp:1562] TasksKiller::freeze: 
/sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542
I1020 12:07:20.268756  9274 cgroups.cpp:3083] Freezing cgroup 
/sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542
I1020 12:07:20.269533  9276 cgroups.cpp:1397] Freezer::freeze: 
/sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542
I1020 12:07:20.270486  9276 cgroups.cpp:1422] Freezer::freeze 2: 
/sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542: FREEZING
I1020 12:07:20.270725  9272 cgroups.cpp:1397] Freezer::freeze: 
/sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542
I1020 12:07:20.271625  9272 cgroups.cpp:1415] Successfully froze cgroup 
/sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542 after 1secs
I1020 12:07:20.271724  9272 hierarchical.cpp:1488] Performed allocation for 1 
agents in 18541ns
I1020 12:07:20.271767  9272 cgroups.cpp:1573] TasksKiller::kill: 
/sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542
I1020 12:07:20.273386  9272 cgroups.cpp:1596] TasksKiller::thaw: 
/sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542
I1020 12:07:20.273486  9272 cgroups.cpp:3101] Thawing cgroup 
/sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542
I1020 12:07:20.274129  9272 cgroups.cpp:1431] Freezer::thaw: 

[jira] [Commented] (MESOS-7506) Multiple tests leave orphan containers.

2017-10-20 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16212605#comment-16212605
 ] 

Andrei Budnik commented on MESOS-7506:
--

Bug has been reproduced with extra debug logs 
(SlaveTest.ShutdownUnregisteredExecutor):
{code}
I1020 12:07:20.266032  9274 containerizer.cpp:2220] Destroying container 
7f9cb5a6-26c9-4010-ace9-b9cb3e065542 in RUNNING state
I1020 12:07:20.266042  9274 containerizer.cpp:2784] Transitioning the state of 
container 7f9cb5a6-26c9-4010-ace9-b9cb3e065542 from RUNNING to DESTROYING
I1020 12:07:20.266175  9274 linux_launcher.cpp:514] Asked to destroy container 
7f9cb5a6-26c9-4010-ace9-b9cb3e065542
I1020 12:07:20.266717  9274 linux_launcher.cpp:560] Using freezer to destroy 
cgroup mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542
I1020 12:07:20.268649  9274 cgroups.cpp:1562] TasksKiller::freeze: 
/sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542
I1020 12:07:20.268756  9274 cgroups.cpp:3083] Freezing cgroup 
/sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542
I1020 12:07:20.269533  9276 cgroups.cpp:1397] Freezer::freeze: 
/sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542
I1020 12:07:20.270486  9276 cgroups.cpp:1422] Freezer::freeze 2: 
/sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542: FREEZING
I1020 12:07:20.270725  9272 cgroups.cpp:1397] Freezer::freeze: 
/sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542
I1020 12:07:20.271625  9272 cgroups.cpp:1415] Successfully froze cgroup 
/sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542 after 1secs
I1020 12:07:20.271724  9272 hierarchical.cpp:1488] Performed allocation for 1 
agents in 18541ns
I1020 12:07:20.271767  9272 cgroups.cpp:1573] TasksKiller::kill: 
/sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542
I1020 12:07:20.273386  9272 cgroups.cpp:1596] TasksKiller::thaw: 
/sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542
I1020 12:07:20.273486  9272 cgroups.cpp:3101] Thawing cgroup 
/sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542
I1020 12:07:20.274129  9272 cgroups.cpp:1431] Freezer::thaw: 
/sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542
I1020 12:07:20.276964  9272 cgroups.cpp:1448] Successfully thawed cgroup 
/sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542 after 0ns
I1020 12:07:20.277225  9277 cgroups.cpp:1602] TasksKiller::reap: 
/sys/fs/cgroup/freezer/mesos/7f9cb5a6-26c9-4010-ace9-b9cb3e065542
I1020 12:07:20.277613  9279 hierarchical.cpp:1488] Performed allocation for 1 
agents in 17680ns
I1020 12:07:20.22  9279 containerizer.cpp:2671] Container 
7f9cb5a6-26c9-4010-ace9-b9cb3e065542 has exited
{code}
{{TasksKiller::finished}} wasn't called, while {{TasksKiller::reap}} was 
called. So, I assume there is a race condition in {{TasksKiller::kill}}. 
Probably, {{cgroups::processes()}} called in {{TasksKiller::kill}} returns a 
list L1 which differs from a list L2 returned by the same function in 
{{cgroups::kill}}.

> Multiple tests leave orphan containers.
> ---
>
> Key: MESOS-7506
> URL: https://issues.apache.org/jira/browse/MESOS-7506
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
> Environment: Ubuntu 16.04
> Fedora 23
> other Linux distros
>Reporter: Alexander Rukletsov
>Assignee: Andrei Budnik
>  Labels: containerizer, flaky-test, mesosphere
>
> I've observed a number of flaky tests that leave orphan containers upon 
> cleanup. A typical log looks like this:
> {noformat}
> ../../src/tests/cluster.cpp:580: Failure
> Value of: containers->empty()
>   Actual: false
> Expected: true
> Failed to destroy containers: { da3e8aa8-98e7-4e72-a8fd-5d0bae960014 }
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-7506) Multiple tests leave orphan containers.

2017-10-18 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16209541#comment-16209541
 ] 

Andrei Budnik commented on MESOS-7506:
--

All failing tests have the same error message in logs like:
{{E0922 00:38:40.509032 31034 slave.cpp:5398] Termination of executor '1' of 
framework 83bd1613-70d9-4c3e-b490-4aa60dd26e22- failed: Failed to kill all 
processes in the container: Timed out after 1mins}}

The container termination future is triggered by 
[MesosContainerizerProcess::___destroy|https://github.com/apache/mesos/blob/b361801f2c78043459199dab3e0defe9a0b4c1aa/src/slave/containerizer/mesos/containerizer.cpp#L2361].
 Agent subscribes to this future by calling 
[containerizer->wait()|https://github.com/apache/mesos/blob/b361801f2c78043459199dab3e0defe9a0b4c1aa/src/slave/slave.cpp#L5280].
 Triggering this future leads to calling of {{Slave::executorTerminated}}, 
which sends {{TASK_FAILED}} status update.

Typical test (e.g. {{SlaveTest.ShutdownUnregisteredExecutor}}) waits for
{code}
  // Ensure that the slave times out and kills the executor.
  Future destroyExecutor =
FUTURE_DISPATCH(_, ::destroy);
{code}

After that, the test waits for {{TASK_FAILED}} status update. So, this test 
completes successfully and slave's destructor is called, [which 
fails|https://github.com/apache/mesos/blob/b361801f2c78043459199dab3e0defe9a0b4c1aa/src/tests/cluster.cpp#L580],
 because {{MesosContainerizerProcess::___destroy}} doesn't erase container from 
the hashmap.

> Multiple tests leave orphan containers.
> ---
>
> Key: MESOS-7506
> URL: https://issues.apache.org/jira/browse/MESOS-7506
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
> Environment: Ubuntu 16.04
> Fedora 23
> other Linux distros
>Reporter: Alexander Rukletsov
>Assignee: Andrei Budnik
>  Labels: containerizer, flaky-test, mesosphere
>
> I've observed a number of flaky tests that leave orphan containers upon 
> cleanup. A typical log looks like this:
> {noformat}
> ../../src/tests/cluster.cpp:580: Failure
> Value of: containers->empty()
>   Actual: false
> Expected: true
> Failed to destroy containers: { da3e8aa8-98e7-4e72-a8fd-5d0bae960014 }
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-7504) Parent's mount namespace cannot be determined when launching a nested container.

2017-10-17 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16207751#comment-16207751
 ] 

Andrei Budnik commented on MESOS-7504:
--

List of failing tests:
{{NestedMesosContainerizerTest.ROOT_CGROUPS_DestroyDebugContainerOnRecover}}
{{ROOT_CGROUPS_DebugNestedContainerInheritsEnvironment}}

> Parent's mount namespace cannot be determined when launching a nested 
> container.
> 
>
> Key: MESOS-7504
> URL: https://issues.apache.org/jira/browse/MESOS-7504
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 1.3.0
> Environment: Ubuntu 16.04
>Reporter: Alexander Rukletsov
>Assignee: Andrei Budnik
>  Labels: containerizer, flaky-test, mesosphere
>
> I've observed this failure twice in different Linux environments. Here is an 
> example of such failure:
> {noformat}
> [ RUN  ] 
> NestedMesosContainerizerTest.ROOT_CGROUPS_DestroyDebugContainerOnRecover
> I0509 21:53:25.471657 17167 containerizer.cpp:221] Using isolation: 
> cgroups/cpu,filesystem/linux,namespaces/pid,network/cni,volume/image
> I0509 21:53:25.475124 17167 linux_launcher.cpp:150] Using 
> /sys/fs/cgroup/freezer as the freezer hierarchy for the Linux launcher
> I0509 21:53:25.475407 17167 provisioner.cpp:249] Using default backend 
> 'overlay'
> I0509 21:53:25.481232 17186 containerizer.cpp:608] Recovering containerizer
> I0509 21:53:25.482295 17186 provisioner.cpp:410] Provisioner recovery complete
> I0509 21:53:25.482587 17187 containerizer.cpp:1001] Starting container 
> 21bc372c-0f2c-49f5-b8ab-8d32c232b95d for executor 'executor' of framework 
> I0509 21:53:25.482918 17189 cgroups.cpp:410] Creating cgroup at 
> '/sys/fs/cgroup/cpu,cpuacct/mesos_test_d989f526-efe0-4553-bf79-936ad66c3753/21bc372c-0f2c-49f5-b8ab-8d32c232b95d'
>  for container 21bc372c-0f2c-49f5-b8ab-8d32c232b95d
> I0509 21:53:25.484103 17190 cpu.cpp:101] Updated 'cpu.shares' to 1024 (cpus 
> 1) for container 21bc372c-0f2c-49f5-b8ab-8d32c232b95d
> I0509 21:53:25.484808 17186 containerizer.cpp:1524] Launching 
> 'mesos-containerizer' with flags '--help="false" 
> --launch_info="{"clone_namespaces":[131072,536870912],"command":{"shell":true,"value":"sleep
>  
> 1000"},"environment":{"variables":[{"name":"MESOS_SANDBOX","type":"VALUE","value":"\/tmp\/NestedMesosContainerizerTest_ROOT_CGROUPS_DestroyDebugContainerOnRecover_zlywyr"}]},"pre_exec_commands":[{"arguments":["mesos-containerizer","mount","--help=false","--operation=make-rslave","--path=\/"],"shell":false,"value":"\/home\/ubuntu\/workspace\/mesos\/Mesos_CI-build\/FLAG\/SSL\/label\/mesos-ec2-ubuntu-16.04\/mesos\/build\/src\/mesos-containerizer"},{"shell":true,"value":"mount
>  -n -t proc proc \/proc -o 
> nosuid,noexec,nodev"}],"working_directory":"\/tmp\/NestedMesosContainerizerTest_ROOT_CGROUPS_DestroyDebugContainerOnRecover_zlywyr"}"
>  --pipe_read="29" --pipe_write="32" 
> --runtime_directory="/tmp/NestedMesosContainerizerTest_ROOT_CGROUPS_DestroyDebugContainerOnRecover_sKhtj7/containers/21bc372c-0f2c-49f5-b8ab-8d32c232b95d"
>  --unshare_namespace_mnt="false"'
> I0509 21:53:25.484978 17189 linux_launcher.cpp:429] Launching container 
> 21bc372c-0f2c-49f5-b8ab-8d32c232b95d and cloning with namespaces CLONE_NEWNS 
> | CLONE_NEWPID
> I0509 21:53:25.513890 17186 containerizer.cpp:1623] Checkpointing container's 
> forked pid 1873 to 
> '/tmp/NestedMesosContainerizerTest_ROOT_CGROUPS_DestroyDebugContainerOnRecover_Rdjw6M/meta/slaves/frameworks/executors/executor/runs/21bc372c-0f2c-49f5-b8ab-8d32c232b95d/pids/forked.pid'
> I0509 21:53:25.515878 17190 fetcher.cpp:353] Starting to fetch URIs for 
> container: 21bc372c-0f2c-49f5-b8ab-8d32c232b95d, directory: 
> /tmp/NestedMesosContainerizerTest_ROOT_CGROUPS_DestroyDebugContainerOnRecover_zlywyr
> I0509 21:53:25.517715 17193 containerizer.cpp:1791] Starting nested container 
> 21bc372c-0f2c-49f5-b8ab-8d32c232b95d.ea991d38-e1a5-44fe-a522-622b15142e35
> I0509 21:53:25.518569 17193 switchboard.cpp:545] Launching 
> 'mesos-io-switchboard' with flags '--heartbeat_interval="30secs" 
> --help="false" 
> --socket_address="/tmp/mesos-io-switchboard-ca463cf2-70ba-4121-a5c6-1a170ae40c1b"
>  --stderr_from_fd="36" --stderr_to_fd="2" --stdin_to_fd="32" 
> --stdout_from_fd="33" --stdout_to_fd="1" --tty="false" 
> --wait_for_connection="true"' for container 
> 21bc372c-0f2c-49f5-b8ab-8d32c232b95d.ea991d38-e1a5-44fe-a522-622b15142e35
> I0509 21:53:25.521229 17193 switchboard.cpp:575] Created I/O switchboard 
> server (pid: 1881) listening on socket file 
> '/tmp/mesos-io-switchboard-ca463cf2-70ba-4121-a5c6-1a170ae40c1b' for 
> container 
> 21bc372c-0f2c-49f5-b8ab-8d32c232b95d.ea991d38-e1a5-44fe-a522-622b15142e35
> I0509 21:53:25.522195 17191 containerizer.cpp:1524] Launching 
> 

[jira] [Commented] (MESOS-7504) Parent's mount namespace cannot be determined when launching a nested container.

2017-10-17 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16207669#comment-16207669
 ] 

Andrei Budnik commented on MESOS-7504:
--

https://reviews.apache.org/r/63074/
https://reviews.apache.org/r/63035/

> Parent's mount namespace cannot be determined when launching a nested 
> container.
> 
>
> Key: MESOS-7504
> URL: https://issues.apache.org/jira/browse/MESOS-7504
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 1.3.0
> Environment: Ubuntu 16.04
>Reporter: Alexander Rukletsov
>Assignee: Andrei Budnik
>  Labels: containerizer, flaky-test, mesosphere
>
> I've observed this failure twice in different Linux environments. Here is an 
> example of such failure:
> {noformat}
> [ RUN  ] 
> NestedMesosContainerizerTest.ROOT_CGROUPS_DestroyDebugContainerOnRecover
> I0509 21:53:25.471657 17167 containerizer.cpp:221] Using isolation: 
> cgroups/cpu,filesystem/linux,namespaces/pid,network/cni,volume/image
> I0509 21:53:25.475124 17167 linux_launcher.cpp:150] Using 
> /sys/fs/cgroup/freezer as the freezer hierarchy for the Linux launcher
> I0509 21:53:25.475407 17167 provisioner.cpp:249] Using default backend 
> 'overlay'
> I0509 21:53:25.481232 17186 containerizer.cpp:608] Recovering containerizer
> I0509 21:53:25.482295 17186 provisioner.cpp:410] Provisioner recovery complete
> I0509 21:53:25.482587 17187 containerizer.cpp:1001] Starting container 
> 21bc372c-0f2c-49f5-b8ab-8d32c232b95d for executor 'executor' of framework 
> I0509 21:53:25.482918 17189 cgroups.cpp:410] Creating cgroup at 
> '/sys/fs/cgroup/cpu,cpuacct/mesos_test_d989f526-efe0-4553-bf79-936ad66c3753/21bc372c-0f2c-49f5-b8ab-8d32c232b95d'
>  for container 21bc372c-0f2c-49f5-b8ab-8d32c232b95d
> I0509 21:53:25.484103 17190 cpu.cpp:101] Updated 'cpu.shares' to 1024 (cpus 
> 1) for container 21bc372c-0f2c-49f5-b8ab-8d32c232b95d
> I0509 21:53:25.484808 17186 containerizer.cpp:1524] Launching 
> 'mesos-containerizer' with flags '--help="false" 
> --launch_info="{"clone_namespaces":[131072,536870912],"command":{"shell":true,"value":"sleep
>  
> 1000"},"environment":{"variables":[{"name":"MESOS_SANDBOX","type":"VALUE","value":"\/tmp\/NestedMesosContainerizerTest_ROOT_CGROUPS_DestroyDebugContainerOnRecover_zlywyr"}]},"pre_exec_commands":[{"arguments":["mesos-containerizer","mount","--help=false","--operation=make-rslave","--path=\/"],"shell":false,"value":"\/home\/ubuntu\/workspace\/mesos\/Mesos_CI-build\/FLAG\/SSL\/label\/mesos-ec2-ubuntu-16.04\/mesos\/build\/src\/mesos-containerizer"},{"shell":true,"value":"mount
>  -n -t proc proc \/proc -o 
> nosuid,noexec,nodev"}],"working_directory":"\/tmp\/NestedMesosContainerizerTest_ROOT_CGROUPS_DestroyDebugContainerOnRecover_zlywyr"}"
>  --pipe_read="29" --pipe_write="32" 
> --runtime_directory="/tmp/NestedMesosContainerizerTest_ROOT_CGROUPS_DestroyDebugContainerOnRecover_sKhtj7/containers/21bc372c-0f2c-49f5-b8ab-8d32c232b95d"
>  --unshare_namespace_mnt="false"'
> I0509 21:53:25.484978 17189 linux_launcher.cpp:429] Launching container 
> 21bc372c-0f2c-49f5-b8ab-8d32c232b95d and cloning with namespaces CLONE_NEWNS 
> | CLONE_NEWPID
> I0509 21:53:25.513890 17186 containerizer.cpp:1623] Checkpointing container's 
> forked pid 1873 to 
> '/tmp/NestedMesosContainerizerTest_ROOT_CGROUPS_DestroyDebugContainerOnRecover_Rdjw6M/meta/slaves/frameworks/executors/executor/runs/21bc372c-0f2c-49f5-b8ab-8d32c232b95d/pids/forked.pid'
> I0509 21:53:25.515878 17190 fetcher.cpp:353] Starting to fetch URIs for 
> container: 21bc372c-0f2c-49f5-b8ab-8d32c232b95d, directory: 
> /tmp/NestedMesosContainerizerTest_ROOT_CGROUPS_DestroyDebugContainerOnRecover_zlywyr
> I0509 21:53:25.517715 17193 containerizer.cpp:1791] Starting nested container 
> 21bc372c-0f2c-49f5-b8ab-8d32c232b95d.ea991d38-e1a5-44fe-a522-622b15142e35
> I0509 21:53:25.518569 17193 switchboard.cpp:545] Launching 
> 'mesos-io-switchboard' with flags '--heartbeat_interval="30secs" 
> --help="false" 
> --socket_address="/tmp/mesos-io-switchboard-ca463cf2-70ba-4121-a5c6-1a170ae40c1b"
>  --stderr_from_fd="36" --stderr_to_fd="2" --stdin_to_fd="32" 
> --stdout_from_fd="33" --stdout_to_fd="1" --tty="false" 
> --wait_for_connection="true"' for container 
> 21bc372c-0f2c-49f5-b8ab-8d32c232b95d.ea991d38-e1a5-44fe-a522-622b15142e35
> I0509 21:53:25.521229 17193 switchboard.cpp:575] Created I/O switchboard 
> server (pid: 1881) listening on socket file 
> '/tmp/mesos-io-switchboard-ca463cf2-70ba-4121-a5c6-1a170ae40c1b' for 
> container 
> 21bc372c-0f2c-49f5-b8ab-8d32c232b95d.ea991d38-e1a5-44fe-a522-622b15142e35
> I0509 21:53:25.522195 17191 containerizer.cpp:1524] Launching 
> 'mesos-containerizer' with flags '--help="false" 
> 

[jira] [Commented] (MESOS-8005) Mesos.SlaveTest.ShutdownUnregisteredExecutor is flaky

2017-10-11 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16200383#comment-16200383
 ] 

Andrei Budnik commented on MESOS-8005:
--

{code}
[ RUN  ] SlaveTest.ShutdownUnregisteredExecutor
I0922 00:38:40.364121 31018 cluster.cpp:162] Creating default 'local' authorizer
I0922 00:38:40.365996 31034 master.cpp:445] Master 
83bd1613-70d9-4c3e-b490-4aa60dd26e22 (ip-172-16-10-25) started on 
172.16.10.25:44747
I0922 00:38:40.366019 31034 master.cpp:447] Flags at startup: --acls="" 
--agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
--allocation_interval="1secs" --allocator="HierarchicalDRF" 
--authenticate_agents="true" --authenticate_frameworks="true" 
--authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
--authenticate_http_readwrite="true" --authenticators="crammd5" 
--authorizers="local" --credentials="/tmp/u6YBLG/credentials" 
--filter_gpu_resources="true" --framework_sorter="drf" --help="false" 
--hostname_lookup="true" --http_authenticators="basic" 
--http_framework_authenticators="basic" --initialize_driver_logging="true" 
--log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" 
--max_agent_ping_timeouts="5" --max_completed_frameworks="50" 
--max_completed_tasks_per_framework="1000" 
--max_unreachable_tasks_per_framework="1000" --port="5050" --quiet="false" 
--recovery_agent_removal_limit="100%" --registry="in_memory" 
--registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
--registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
--registry_store_timeout="100secs" --registry_strict="false" 
--root_submissions="true" --user_sorter="drf" --version="false" 
--webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/u6YBLG/master" 
--zk_session_timeout="10secs"
I0922 00:38:40.366137 31034 master.cpp:497] Master only allowing authenticated 
frameworks to register
I0922 00:38:40.366145 31034 master.cpp:511] Master only allowing authenticated 
agents to register
I0922 00:38:40.366150 31034 master.cpp:524] Master only allowing authenticated 
HTTP frameworks to register
I0922 00:38:40.366155 31034 credentials.hpp:37] Loading credentials for 
authentication from '/tmp/u6YBLG/credentials'
I0922 00:38:40.366237 31034 master.cpp:569] Using default 'crammd5' 
authenticator
I0922 00:38:40.366286 31034 http.cpp:1045] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-readonly'
I0922 00:38:40.366349 31034 http.cpp:1045] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-readwrite'
I0922 00:38:40.366389 31034 http.cpp:1045] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-scheduler'
I0922 00:38:40.366443 31034 master.cpp:649] Authorization enabled
I0922 00:38:40.366475 31039 hierarchical.cpp:171] Initialized hierarchical 
allocator process
I0922 00:38:40.366564 31038 whitelist_watcher.cpp:77] No whitelist given
I0922 00:38:40.367216 31036 master.cpp:2166] Elected as the leading master!
I0922 00:38:40.367238 31036 master.cpp:1705] Recovering from registrar
I0922 00:38:40.367282 31036 registrar.cpp:347] Recovering registrar
I0922 00:38:40.367449 31036 registrar.cpp:391] Successfully fetched the 
registry (0B) in 150016ns
I0922 00:38:40.367483 31036 registrar.cpp:495] Applied 1 operations in 5392ns; 
attempting to update the registry
I0922 00:38:40.367624 31034 registrar.cpp:552] Successfully updated the 
registry in 119808ns
I0922 00:38:40.367697 31034 registrar.cpp:424] Successfully recovered registrar
I0922 00:38:40.367858 31036 hierarchical.cpp:209] Skipping recovery of 
hierarchical allocator: nothing to recover
I0922 00:38:40.367869 31037 master.cpp:1804] Recovered 0 agents from the 
registry (142B); allowing 10mins for agents to re-register
I0922 00:38:40.368898 31018 containerizer.cpp:292] Using isolation { 
environment_secret, posix/cpu, posix/mem, filesystem/posix, network/cni }
I0922 00:38:40.372519 31018 linux_launcher.cpp:146] Using 
/sys/fs/cgroup/freezer as the freezer hierarchy for the Linux launcher
I0922 00:38:40.372859 31018 provisioner.cpp:255] Using default backend 'overlay'
W0922 00:38:40.375388 31018 process.cpp:3194] Attempted to spawn already 
running process files@172.16.10.25:44747
I0922 00:38:40.375486 31018 cluster.cpp:448] Creating default 'local' authorizer
I0922 00:38:40.375942 31036 slave.cpp:254] Mesos agent started on 
(531)@172.16.10.25:44747
W0922 00:38:40.376080 31018 process.cpp:3194] Attempted to spawn already 
running process version@172.16.10.25:44747
I0922 00:38:40.375958 31036 slave.cpp:255] Flags at startup: --acls="" 
--appc_simple_discovery_uri_prefix="http://; 
--appc_store_dir="/tmp/SlaveTest_ShutdownUnregisteredExecutor_mhaf10/store/appc"
 --authenticate_http_executors="true" --authenticate_http_readonly="true" 
--authenticate_http_readwrite="true" --authenticatee="crammd5" 
--authentication_backoff_factor="1secs" --authorizer="local" 

[jira] [Commented] (MESOS-7506) Multiple tests leave orphan containers.

2017-10-11 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16200341#comment-16200341
 ] 

Andrei Budnik commented on MESOS-7506:
--

I put a {{::sleep(2);}} after {{slave = this->StartSlave(detector.get(), 
containerizer.get(), flags);}} in 
[SlaveRecoveryTest.RecoverTerminatedExecutor|https://github.com/apache/mesos/blob/0908303142f641c1697547eb7f8e82a205d6c362/src/tests/slave_recovery_tests.cpp#L1634]
 and got:

{code}
../../src/tests/slave_recovery_tests.cpp:1656: Failure
  Expected: TASK_LOST
To be equal to: status->state()
  Which is: TASK_FAILED
{code}

> Multiple tests leave orphan containers.
> ---
>
> Key: MESOS-7506
> URL: https://issues.apache.org/jira/browse/MESOS-7506
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
> Environment: Ubuntu 16.04
> Fedora 23
> other Linux distros
>Reporter: Alexander Rukletsov
>Assignee: Andrei Budnik
>  Labels: containerizer, flaky-test, mesosphere
>
> I've observed a number of flaky tests that leave orphan containers upon 
> cleanup. A typical log looks like this:
> {noformat}
> ../../src/tests/cluster.cpp:580: Failure
> Value of: containers->empty()
>   Actual: false
> Expected: true
> Failed to destroy containers: { da3e8aa8-98e7-4e72-a8fd-5d0bae960014 }
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-7504) Parent's mount namespace cannot be determined when launching a nested container.

2017-10-05 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16193100#comment-16193100
 ] 

Andrei Budnik commented on MESOS-7504:
--

Containerizer launcher spawns 
[pre-exec-hooks|https://github.com/apache/mesos/blob/46db7e4f27831d20244a57b22a70312f2a574395/src/slave/containerizer/mesos/launch.cpp#L384]
 before launching given command (e.g. `sleep 1000`).
For 
{{NestedMesosContainerizerTest.ROOT_CGROUPS_DestroyDebugContainerOnRecover}} 
test, we need to enter {{"cgroups/cpu,filesystem/linux,namespaces/pid"}} 
namespaces, where `filesystem/linux` and `namespaces/pid` isolators add 2 
pre-exec-hooks, from logs:
{code}
Executing pre-exec command 
'{"arguments":["mesos-containerizer","mount","--help=false","--operation=make-rslave","--path=\/"],"shell":false,"value":"\/home\/abudnik\/mesos\/build\/src\/mesos-containerizer"}'
Executing pre-exec command '{"shell":true,"value":"mount -n -t proc proc \/proc 
-o nosuid,noexec,nodev"}'
{code}
After launching parent container, we try to launch nested container. Agent 
[calls|https://github.com/apache/mesos/blob/46db7e4f27831d20244a57b22a70312f2a574395/src/slave/containerizer/mesos/containerizer.cpp#L1758]
  
[getMountNamespaceTarget|https://github.com/apache/mesos/blob/46db7e4f27831d20244a57b22a70312f2a574395/src/slave/containerizer/mesos/utils.cpp#L59]
 function, which returns the "Cannot get target mount namespace from process" 
error in this test.
If you take a look at it, you'll find that there is a small delay after 
enumerating all child processes (which might still contain running 
pre-exec-hook processes) and before calling {{ns::getns}} for each child 
process. During this delay any of pre-exec-hook processes might exit, hence 
causing this error message.

> Parent's mount namespace cannot be determined when launching a nested 
> container.
> 
>
> Key: MESOS-7504
> URL: https://issues.apache.org/jira/browse/MESOS-7504
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 1.3.0
> Environment: Ubuntu 16.04
>Reporter: Alexander Rukletsov
>Assignee: Andrei Budnik
>  Labels: containerizer, flaky-test, mesosphere
>
> I've observed this failure twice in different Linux environments. Here is an 
> example of such failure:
> {noformat}
> [ RUN  ] 
> NestedMesosContainerizerTest.ROOT_CGROUPS_DestroyDebugContainerOnRecover
> I0509 21:53:25.471657 17167 containerizer.cpp:221] Using isolation: 
> cgroups/cpu,filesystem/linux,namespaces/pid,network/cni,volume/image
> I0509 21:53:25.475124 17167 linux_launcher.cpp:150] Using 
> /sys/fs/cgroup/freezer as the freezer hierarchy for the Linux launcher
> I0509 21:53:25.475407 17167 provisioner.cpp:249] Using default backend 
> 'overlay'
> I0509 21:53:25.481232 17186 containerizer.cpp:608] Recovering containerizer
> I0509 21:53:25.482295 17186 provisioner.cpp:410] Provisioner recovery complete
> I0509 21:53:25.482587 17187 containerizer.cpp:1001] Starting container 
> 21bc372c-0f2c-49f5-b8ab-8d32c232b95d for executor 'executor' of framework 
> I0509 21:53:25.482918 17189 cgroups.cpp:410] Creating cgroup at 
> '/sys/fs/cgroup/cpu,cpuacct/mesos_test_d989f526-efe0-4553-bf79-936ad66c3753/21bc372c-0f2c-49f5-b8ab-8d32c232b95d'
>  for container 21bc372c-0f2c-49f5-b8ab-8d32c232b95d
> I0509 21:53:25.484103 17190 cpu.cpp:101] Updated 'cpu.shares' to 1024 (cpus 
> 1) for container 21bc372c-0f2c-49f5-b8ab-8d32c232b95d
> I0509 21:53:25.484808 17186 containerizer.cpp:1524] Launching 
> 'mesos-containerizer' with flags '--help="false" 
> --launch_info="{"clone_namespaces":[131072,536870912],"command":{"shell":true,"value":"sleep
>  
> 1000"},"environment":{"variables":[{"name":"MESOS_SANDBOX","type":"VALUE","value":"\/tmp\/NestedMesosContainerizerTest_ROOT_CGROUPS_DestroyDebugContainerOnRecover_zlywyr"}]},"pre_exec_commands":[{"arguments":["mesos-containerizer","mount","--help=false","--operation=make-rslave","--path=\/"],"shell":false,"value":"\/home\/ubuntu\/workspace\/mesos\/Mesos_CI-build\/FLAG\/SSL\/label\/mesos-ec2-ubuntu-16.04\/mesos\/build\/src\/mesos-containerizer"},{"shell":true,"value":"mount
>  -n -t proc proc \/proc -o 
> nosuid,noexec,nodev"}],"working_directory":"\/tmp\/NestedMesosContainerizerTest_ROOT_CGROUPS_DestroyDebugContainerOnRecover_zlywyr"}"
>  --pipe_read="29" --pipe_write="32" 
> --runtime_directory="/tmp/NestedMesosContainerizerTest_ROOT_CGROUPS_DestroyDebugContainerOnRecover_sKhtj7/containers/21bc372c-0f2c-49f5-b8ab-8d32c232b95d"
>  --unshare_namespace_mnt="false"'
> I0509 21:53:25.484978 17189 linux_launcher.cpp:429] Launching container 
> 21bc372c-0f2c-49f5-b8ab-8d32c232b95d and cloning with namespaces CLONE_NEWNS 
> | CLONE_NEWPID
> I0509 21:53:25.513890 17186 containerizer.cpp:1623] 

[jira] [Commented] (MESOS-7504) Parent's mount namespace cannot be determined when launching a nested container.

2017-10-05 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16192797#comment-16192797
 ] 

Andrei Budnik commented on MESOS-7504:
--

Code modifications to reproduce test failure:
1. Add {{::sleep(1);}} to 
https://github.com/apache/mesos/blob/657a930e173aaee7a168734bf59e8eb022d6668f/src/tests/containerizer/nested_mesos_containerizer_tests.cpp#L1144
2. Add {{launchInfo.add_pre_exec_commands()->set_value("sleep 2");}} to 
https://github.com/apache/mesos/blob/657a930e173aaee7a168734bf59e8eb022d6668f/src/slave/containerizer/mesos/isolators/namespaces/pid.cpp#L135
3. Add {{::sleep(3);}} to 
https://github.com/apache/mesos/blob/657a930e173aaee7a168734bf59e8eb022d6668f/src/slave/containerizer/mesos/utils.cpp#L73

> Parent's mount namespace cannot be determined when launching a nested 
> container.
> 
>
> Key: MESOS-7504
> URL: https://issues.apache.org/jira/browse/MESOS-7504
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 1.3.0
> Environment: Ubuntu 16.04
>Reporter: Alexander Rukletsov
>Assignee: Andrei Budnik
>  Labels: containerizer, flaky-test, mesosphere
>
> I've observed this failure twice in different Linux environments. Here is an 
> example of such failure:
> {noformat}
> [ RUN  ] 
> NestedMesosContainerizerTest.ROOT_CGROUPS_DestroyDebugContainerOnRecover
> I0509 21:53:25.471657 17167 containerizer.cpp:221] Using isolation: 
> cgroups/cpu,filesystem/linux,namespaces/pid,network/cni,volume/image
> I0509 21:53:25.475124 17167 linux_launcher.cpp:150] Using 
> /sys/fs/cgroup/freezer as the freezer hierarchy for the Linux launcher
> I0509 21:53:25.475407 17167 provisioner.cpp:249] Using default backend 
> 'overlay'
> I0509 21:53:25.481232 17186 containerizer.cpp:608] Recovering containerizer
> I0509 21:53:25.482295 17186 provisioner.cpp:410] Provisioner recovery complete
> I0509 21:53:25.482587 17187 containerizer.cpp:1001] Starting container 
> 21bc372c-0f2c-49f5-b8ab-8d32c232b95d for executor 'executor' of framework 
> I0509 21:53:25.482918 17189 cgroups.cpp:410] Creating cgroup at 
> '/sys/fs/cgroup/cpu,cpuacct/mesos_test_d989f526-efe0-4553-bf79-936ad66c3753/21bc372c-0f2c-49f5-b8ab-8d32c232b95d'
>  for container 21bc372c-0f2c-49f5-b8ab-8d32c232b95d
> I0509 21:53:25.484103 17190 cpu.cpp:101] Updated 'cpu.shares' to 1024 (cpus 
> 1) for container 21bc372c-0f2c-49f5-b8ab-8d32c232b95d
> I0509 21:53:25.484808 17186 containerizer.cpp:1524] Launching 
> 'mesos-containerizer' with flags '--help="false" 
> --launch_info="{"clone_namespaces":[131072,536870912],"command":{"shell":true,"value":"sleep
>  
> 1000"},"environment":{"variables":[{"name":"MESOS_SANDBOX","type":"VALUE","value":"\/tmp\/NestedMesosContainerizerTest_ROOT_CGROUPS_DestroyDebugContainerOnRecover_zlywyr"}]},"pre_exec_commands":[{"arguments":["mesos-containerizer","mount","--help=false","--operation=make-rslave","--path=\/"],"shell":false,"value":"\/home\/ubuntu\/workspace\/mesos\/Mesos_CI-build\/FLAG\/SSL\/label\/mesos-ec2-ubuntu-16.04\/mesos\/build\/src\/mesos-containerizer"},{"shell":true,"value":"mount
>  -n -t proc proc \/proc -o 
> nosuid,noexec,nodev"}],"working_directory":"\/tmp\/NestedMesosContainerizerTest_ROOT_CGROUPS_DestroyDebugContainerOnRecover_zlywyr"}"
>  --pipe_read="29" --pipe_write="32" 
> --runtime_directory="/tmp/NestedMesosContainerizerTest_ROOT_CGROUPS_DestroyDebugContainerOnRecover_sKhtj7/containers/21bc372c-0f2c-49f5-b8ab-8d32c232b95d"
>  --unshare_namespace_mnt="false"'
> I0509 21:53:25.484978 17189 linux_launcher.cpp:429] Launching container 
> 21bc372c-0f2c-49f5-b8ab-8d32c232b95d and cloning with namespaces CLONE_NEWNS 
> | CLONE_NEWPID
> I0509 21:53:25.513890 17186 containerizer.cpp:1623] Checkpointing container's 
> forked pid 1873 to 
> '/tmp/NestedMesosContainerizerTest_ROOT_CGROUPS_DestroyDebugContainerOnRecover_Rdjw6M/meta/slaves/frameworks/executors/executor/runs/21bc372c-0f2c-49f5-b8ab-8d32c232b95d/pids/forked.pid'
> I0509 21:53:25.515878 17190 fetcher.cpp:353] Starting to fetch URIs for 
> container: 21bc372c-0f2c-49f5-b8ab-8d32c232b95d, directory: 
> /tmp/NestedMesosContainerizerTest_ROOT_CGROUPS_DestroyDebugContainerOnRecover_zlywyr
> I0509 21:53:25.517715 17193 containerizer.cpp:1791] Starting nested container 
> 21bc372c-0f2c-49f5-b8ab-8d32c232b95d.ea991d38-e1a5-44fe-a522-622b15142e35
> I0509 21:53:25.518569 17193 switchboard.cpp:545] Launching 
> 'mesos-io-switchboard' with flags '--heartbeat_interval="30secs" 
> --help="false" 
> --socket_address="/tmp/mesos-io-switchboard-ca463cf2-70ba-4121-a5c6-1a170ae40c1b"
>  --stderr_from_fd="36" --stderr_to_fd="2" --stdin_to_fd="32" 
> --stdout_from_fd="33" --stdout_to_fd="1" --tty="false" 
> --wait_for_connection="true"' for container 
> 

[jira] [Commented] (MESOS-7504) Parent's mount namespace cannot be determined when launching a nested container.

2017-10-04 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16191456#comment-16191456
 ] 

Andrei Budnik commented on MESOS-7504:
--

{{(launch).failure(): Cannot get target mount namespace from process 10991: 
Cannot get 'mnt' namespace for 2nd-level child process '11001': Failed to stat 
mnt namespace handle for pid 11001: No such file or directory}}

> Parent's mount namespace cannot be determined when launching a nested 
> container.
> 
>
> Key: MESOS-7504
> URL: https://issues.apache.org/jira/browse/MESOS-7504
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 1.3.0
> Environment: Ubuntu 16.04
>Reporter: Alexander Rukletsov
>Assignee: Andrei Budnik
>  Labels: containerizer, flaky-test, mesosphere
>
> I've observed this failure twice in different Linux environments. Here is an 
> example of such failure:
> {noformat}
> [ RUN  ] 
> NestedMesosContainerizerTest.ROOT_CGROUPS_DestroyDebugContainerOnRecover
> I0509 21:53:25.471657 17167 containerizer.cpp:221] Using isolation: 
> cgroups/cpu,filesystem/linux,namespaces/pid,network/cni,volume/image
> I0509 21:53:25.475124 17167 linux_launcher.cpp:150] Using 
> /sys/fs/cgroup/freezer as the freezer hierarchy for the Linux launcher
> I0509 21:53:25.475407 17167 provisioner.cpp:249] Using default backend 
> 'overlay'
> I0509 21:53:25.481232 17186 containerizer.cpp:608] Recovering containerizer
> I0509 21:53:25.482295 17186 provisioner.cpp:410] Provisioner recovery complete
> I0509 21:53:25.482587 17187 containerizer.cpp:1001] Starting container 
> 21bc372c-0f2c-49f5-b8ab-8d32c232b95d for executor 'executor' of framework 
> I0509 21:53:25.482918 17189 cgroups.cpp:410] Creating cgroup at 
> '/sys/fs/cgroup/cpu,cpuacct/mesos_test_d989f526-efe0-4553-bf79-936ad66c3753/21bc372c-0f2c-49f5-b8ab-8d32c232b95d'
>  for container 21bc372c-0f2c-49f5-b8ab-8d32c232b95d
> I0509 21:53:25.484103 17190 cpu.cpp:101] Updated 'cpu.shares' to 1024 (cpus 
> 1) for container 21bc372c-0f2c-49f5-b8ab-8d32c232b95d
> I0509 21:53:25.484808 17186 containerizer.cpp:1524] Launching 
> 'mesos-containerizer' with flags '--help="false" 
> --launch_info="{"clone_namespaces":[131072,536870912],"command":{"shell":true,"value":"sleep
>  
> 1000"},"environment":{"variables":[{"name":"MESOS_SANDBOX","type":"VALUE","value":"\/tmp\/NestedMesosContainerizerTest_ROOT_CGROUPS_DestroyDebugContainerOnRecover_zlywyr"}]},"pre_exec_commands":[{"arguments":["mesos-containerizer","mount","--help=false","--operation=make-rslave","--path=\/"],"shell":false,"value":"\/home\/ubuntu\/workspace\/mesos\/Mesos_CI-build\/FLAG\/SSL\/label\/mesos-ec2-ubuntu-16.04\/mesos\/build\/src\/mesos-containerizer"},{"shell":true,"value":"mount
>  -n -t proc proc \/proc -o 
> nosuid,noexec,nodev"}],"working_directory":"\/tmp\/NestedMesosContainerizerTest_ROOT_CGROUPS_DestroyDebugContainerOnRecover_zlywyr"}"
>  --pipe_read="29" --pipe_write="32" 
> --runtime_directory="/tmp/NestedMesosContainerizerTest_ROOT_CGROUPS_DestroyDebugContainerOnRecover_sKhtj7/containers/21bc372c-0f2c-49f5-b8ab-8d32c232b95d"
>  --unshare_namespace_mnt="false"'
> I0509 21:53:25.484978 17189 linux_launcher.cpp:429] Launching container 
> 21bc372c-0f2c-49f5-b8ab-8d32c232b95d and cloning with namespaces CLONE_NEWNS 
> | CLONE_NEWPID
> I0509 21:53:25.513890 17186 containerizer.cpp:1623] Checkpointing container's 
> forked pid 1873 to 
> '/tmp/NestedMesosContainerizerTest_ROOT_CGROUPS_DestroyDebugContainerOnRecover_Rdjw6M/meta/slaves/frameworks/executors/executor/runs/21bc372c-0f2c-49f5-b8ab-8d32c232b95d/pids/forked.pid'
> I0509 21:53:25.515878 17190 fetcher.cpp:353] Starting to fetch URIs for 
> container: 21bc372c-0f2c-49f5-b8ab-8d32c232b95d, directory: 
> /tmp/NestedMesosContainerizerTest_ROOT_CGROUPS_DestroyDebugContainerOnRecover_zlywyr
> I0509 21:53:25.517715 17193 containerizer.cpp:1791] Starting nested container 
> 21bc372c-0f2c-49f5-b8ab-8d32c232b95d.ea991d38-e1a5-44fe-a522-622b15142e35
> I0509 21:53:25.518569 17193 switchboard.cpp:545] Launching 
> 'mesos-io-switchboard' with flags '--heartbeat_interval="30secs" 
> --help="false" 
> --socket_address="/tmp/mesos-io-switchboard-ca463cf2-70ba-4121-a5c6-1a170ae40c1b"
>  --stderr_from_fd="36" --stderr_to_fd="2" --stdin_to_fd="32" 
> --stdout_from_fd="33" --stdout_to_fd="1" --tty="false" 
> --wait_for_connection="true"' for container 
> 21bc372c-0f2c-49f5-b8ab-8d32c232b95d.ea991d38-e1a5-44fe-a522-622b15142e35
> I0509 21:53:25.521229 17193 switchboard.cpp:575] Created I/O switchboard 
> server (pid: 1881) listening on socket file 
> '/tmp/mesos-io-switchboard-ca463cf2-70ba-4121-a5c6-1a170ae40c1b' for 
> container 
> 21bc372c-0f2c-49f5-b8ab-8d32c232b95d.ea991d38-e1a5-44fe-a522-622b15142e35
> 

[jira] [Commented] (MESOS-4812) Mesos fails to escape command health checks

2017-10-04 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16191117#comment-16191117
 ] 

Andrei Budnik commented on MESOS-4812:
--

I have closed [/r/62381|https://reviews.apache.org/r/62381/], for details see 
comment in discard reason.

> Mesos fails to escape command health checks
> ---
>
> Key: MESOS-4812
> URL: https://issues.apache.org/jira/browse/MESOS-4812
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.25.0
>Reporter: Lukas Loesche
>Assignee: Andrei Budnik
>  Labels: health-check, mesosphere, tech-debt
> Attachments: health_task.gif
>
>
> As described in https://github.com/mesosphere/marathon/issues/
> I would like to run a command health check
> {noformat}
> /bin/bash -c " {noformat}
> The health check fails because Mesos, while running the command inside double 
> quotes of a sh -c "" doesn't escape the double quotes in the command.
> If I escape the double quotes myself the command health check succeeds. But 
> this would mean that the user needs intimate knowledge of how Mesos executes 
> his commands which can't be right.
> I was told this is not a Marathon but a Mesos issue so am opening this JIRA. 
> I don't know if this only affects the command health check.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-8037) ns::clone should spawn process, which is a direct child

2017-09-28 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16184801#comment-16184801
 ] 

Andrei Budnik commented on MESOS-8037:
--

Health checks use their own procedure to enter namespaces, see 
https://github.com/apache/mesos/blob/7b79d8d4fb47aca05d28033f34a1f6b75dcfbe87/src/checks/checker_process.cpp#L103-L139

Health checks can't enter PID namespace. Also, the user (client code) of health 
checks should pass list of namespaces in specific order, because the order we 
enter namespaces is important. To solve these problems we could use 
{{ns::clone}}, but it returns a pid of a process, which is not our child, thus 
we can't get its return code which is needed for health checks.

Also, this feature can be used somehow in mesos containerizer, e.g. for logging 
status of an exited process: 
https://github.com/apache/mesos/blob/7b79d8d4fb47aca05d28033f34a1f6b75dcfbe87/src/slave/containerizer/mesos/linux_launcher.cpp#L480
  

> ns::clone should spawn process, which is a direct child
> ---
>
> Key: MESOS-8037
> URL: https://issues.apache.org/jira/browse/MESOS-8037
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Andrei Budnik
>
> `ns::clone` does double-fork in order to be able to enter given PID namespace 
> and returns grandchild's pid, which is not a direct child of a parent 
> process, hence parent process can not retrieve status of an exited grandchild 
> process.
> As second fork is implemented via `os::clone`, we can pass `CLONE_PARENT` 
> flag. Also, we have to handle both intermediate child process and grandchild 
> process to avoid zombies.
> Motivation behind this improvement is that both `docker exec` and `LXC 
> attach` can enter process' PID namespace, while still controlling child's 
> status code.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8037) ns::clone should spawn process, which is a direct child

2017-09-28 Thread Andrei Budnik (JIRA)
Andrei Budnik created MESOS-8037:


 Summary: ns::clone should spawn process, which is a direct child
 Key: MESOS-8037
 URL: https://issues.apache.org/jira/browse/MESOS-8037
 Project: Mesos
  Issue Type: Improvement
Reporter: Andrei Budnik


`ns::clone` does double-fork in order to be able to enter given PID namespace 
and returns grandchild's pid, which is not a direct child of a parent process, 
hence parent process can not retrieve status of an exited grandchild process.
As second fork is implemented via `os::clone`, we can pass `CLONE_PARENT` flag. 
Also, we have to handle both intermediate child process and grandchild process 
to avoid zombies.

Motivation behind this improvement is that both `docker exec` and `LXC attach` 
can enter process' PID namespace, while still controlling child's status code.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-7500) Command checks via agent lead to flaky tests.

2017-09-28 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16184109#comment-16184109
 ] 

Andrei Budnik commented on MESOS-7500:
--

Example of related failing tests:
[ FAILED ] CommandExecutorCheckTest.CommandCheckDeliveredAndReconciled
[ FAILED ] CommandExecutorCheckTest.CommandCheckStatusChange
[ FAILED ] DefaultExecutorCheckTest.CommandCheckDeliveredAndReconciled
[ FAILED ] DefaultExecutorCheckTest.CommandCheckStatusChange
[ FAILED ] DefaultExecutorCheckTest.CommandCheckSeesParentsEnv
[ FAILED ] DefaultExecutorCheckTest.CommandCheckSharesWorkDirWithTask

> Command checks via agent lead to flaky tests.
> -
>
> Key: MESOS-7500
> URL: https://issues.apache.org/jira/browse/MESOS-7500
> Project: Mesos
>  Issue Type: Bug
>Reporter: Alexander Rukletsov
>Assignee: Andrei Budnik
>  Labels: check, flaky-test, health-check, mesosphere
>
> Tests that rely on command checks via agent are flaky on Apache CI. Here is 
> an example from one of the failed run: https://pastebin.com/g2mPgYzu



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-7500) Command checks via agent lead to flaky tests.

2017-09-28 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16184072#comment-16184072
 ] 

Andrei Budnik commented on MESOS-7500:
--

Command health checks are executed via `LAUNCH_NESTED_CONTAINER_SESSION` call 
and launched inside DEBUG container.
DEBUG container is always launched in pair with `mesos-io-switcboard` process. 
After spawning `mesos-io-switcboard` agent tries to connect to it via unix 
domain socket. If DEBUG container exits before `mesos-io-switcboard` exits, 
agent sends SIGTERM to switchboard process after 5 second delay. If 
`mesos-switchboard-process` exits after being killed by signal, then 
`LAUNCH_NESTED_CONTAINER_SESSION` call is considered to be failed as well as 
corresponding health check.
It turned out that `mesos-io-switchboard` is not an executable, but a special 
wrapper script generated by libtool. First time this script is executed, 
relinking of an executable triggered. Relinking takes quite a while on slow 
machines (e.g. in Apache CI): I've seen 8 seconds and more. It turned out, that 
when DEBUG container exits, agent sends SIGTERM (as described above) to a 
process which is still being relinking. This happens each time health check is 
launched and as the result we see a bunch of failed tests in Apache CI.
To fix this issue we need to force libtool/autotools to generate binary instead 
of wrapper script, see:
1. https://autotools.io/libtool/wrappers.html
2. `info libtool`

> Command checks via agent lead to flaky tests.
> -
>
> Key: MESOS-7500
> URL: https://issues.apache.org/jira/browse/MESOS-7500
> Project: Mesos
>  Issue Type: Bug
>Reporter: Alexander Rukletsov
>Assignee: Andrei Budnik
>  Labels: check, flaky-test, health-check, mesosphere
>
> Tests that rely on command checks via agent are flaky on Apache CI. Here is 
> an example from one of the failed run: https://pastebin.com/g2mPgYzu



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (MESOS-7500) Command checks via agent lead to flaky tests.

2017-09-25 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16179000#comment-16179000
 ] 

Andrei Budnik edited comment on MESOS-7500 at 9/25/17 2:18 PM:
---

The issue is caused by recompilation/relinking of an executable by libtool 
wrapper script. E.g. when we launch `mesos-io-switchboard` for the first time, 
executable might be missing, so wrapper script starts to compile/link 
corresponding executable. On slow machines compilation takes quite a while, 
hence these tests become flaky.

One possible solution is to pass [\-\-enable-fast-install=no 
(--disable-fast-install)|http://mdcc.cx/pub/autobook/autobook-latest/html/autobook_85.html]
 as $CONFIGURATION environment variable into docker helper script.


was (Author: abudnik):
The issue is caused by recompilation/relinking of an executable by libtool 
wrapper script. E.g. when we launch `mesos-io-switchboard` for the first time, 
executable might be missing, so wrapper script starts to compile/link 
corresponding executable. On slow machines compilation takes quite a while, 
hence these tests become flaky.

One possible solution is to pass 
[--disable-fast-install|http://mdcc.cx/pub/autobook/autobook-latest/html/autobook_85.html]
 as $CONFIGURATION environment variable into docker helper script.

> Command checks via agent lead to flaky tests.
> -
>
> Key: MESOS-7500
> URL: https://issues.apache.org/jira/browse/MESOS-7500
> Project: Mesos
>  Issue Type: Bug
>Reporter: Alexander Rukletsov
>Assignee: Andrei Budnik
>  Labels: check, flaky-test, health-check, mesosphere
>
> Tests that rely on command checks via agent are flaky on Apache CI. Here is 
> an example from one of the failed run: https://pastebin.com/g2mPgYzu



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7500) Command checks via agent lead to flaky tests.

2017-09-25 Thread Andrei Budnik (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrei Budnik updated MESOS-7500:
-
Story Points: 8  (was: 5)

> Command checks via agent lead to flaky tests.
> -
>
> Key: MESOS-7500
> URL: https://issues.apache.org/jira/browse/MESOS-7500
> Project: Mesos
>  Issue Type: Bug
>Reporter: Alexander Rukletsov
>Assignee: Andrei Budnik
>  Labels: check, flaky-test, health-check, mesosphere
>
> Tests that rely on command checks via agent are flaky on Apache CI. Here is 
> an example from one of the failed run: https://pastebin.com/g2mPgYzu



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (MESOS-7500) Command checks via agent lead to flaky tests.

2017-09-25 Thread Andrei Budnik (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrei Budnik reassigned MESOS-7500:


Assignee: Andrei Budnik  (was: Gastón Kleiman)

> Command checks via agent lead to flaky tests.
> -
>
> Key: MESOS-7500
> URL: https://issues.apache.org/jira/browse/MESOS-7500
> Project: Mesos
>  Issue Type: Bug
>Reporter: Alexander Rukletsov
>Assignee: Andrei Budnik
>  Labels: check, flaky-test, health-check, mesosphere
>
> Tests that rely on command checks via agent are flaky on Apache CI. Here is 
> an example from one of the failed run: https://pastebin.com/g2mPgYzu



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-7500) Command checks via agent lead to flaky tests.

2017-09-25 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16179000#comment-16179000
 ] 

Andrei Budnik commented on MESOS-7500:
--

The issue is caused by recompilation/relinking of an executable by libtool 
wrapper script. E.g. when we launch `mesos-io-switchboard` for the first time, 
executable might be missing, so wrapper script starts to compile/link 
corresponding executable. On slow machines compilation takes quite a while, 
hence these tests become flaky.

One possible solution is to pass 
[--disable-fast-install|http://mdcc.cx/pub/autobook/autobook-latest/html/autobook_85.html]
 as $CONFIGURATION environment variable into docker helper script.

> Command checks via agent lead to flaky tests.
> -
>
> Key: MESOS-7500
> URL: https://issues.apache.org/jira/browse/MESOS-7500
> Project: Mesos
>  Issue Type: Bug
>Reporter: Alexander Rukletsov
>Assignee: Gastón Kleiman
>  Labels: check, flaky-test, health-check, mesosphere
>
> Tests that rely on command checks via agent are flaky on Apache CI. Here is 
> an example from one of the failed run: https://pastebin.com/g2mPgYzu



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-7500) Command checks via agent lead to flaky tests.

2017-09-21 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16174785#comment-16174785
 ] 

Andrei Budnik commented on MESOS-7500:
--

Another example from the failed run, including debug output 
(https://reviews.apache.org/r/59107):
https://pastebin.com/iKA1WaZB

> Command checks via agent lead to flaky tests.
> -
>
> Key: MESOS-7500
> URL: https://issues.apache.org/jira/browse/MESOS-7500
> Project: Mesos
>  Issue Type: Bug
>Reporter: Alexander Rukletsov
>Assignee: Gastón Kleiman
>  Labels: check, flaky-test, health-check, mesosphere
>
> Tests that rely on command checks via agent are flaky on Apache CI. Here is 
> an example from one of the failed run: https://pastebin.com/g2mPgYzu



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (MESOS-4812) Mesos fails to escape command health checks

2017-09-18 Thread Andrei Budnik (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrei Budnik reassigned MESOS-4812:


Assignee: Andrei Budnik  (was: haosdent)

Reworked Haosdent's patch:
https://reviews.apache.org/r/62381/

> Mesos fails to escape command health checks
> ---
>
> Key: MESOS-4812
> URL: https://issues.apache.org/jira/browse/MESOS-4812
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.25.0
>Reporter: Lukas Loesche
>Assignee: Andrei Budnik
>  Labels: health-check, mesosphere, tech-debt
> Attachments: health_task.gif
>
>
> As described in https://github.com/mesosphere/marathon/issues/
> I would like to run a command health check
> {noformat}
> /bin/bash -c " {noformat}
> The health check fails because Mesos, while running the command inside double 
> quotes of a sh -c "" doesn't escape the double quotes in the command.
> If I escape the double quotes myself the command health check succeeds. But 
> this would mean that the user needs intimate knowledge of how Mesos executes 
> his commands which can't be right.
> I was told this is not a Marathon but a Mesos issue so am opening this JIRA. 
> I don't know if this only affects the command health check.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7892) Filter results of `/state` on agent by role.

2017-09-05 Thread Andrei Budnik (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrei Budnik updated MESOS-7892:
-
Fix Version/s: (was: 1.4.0)
   1.5.0

> Filter results of `/state` on agent by role.
> 
>
> Key: MESOS-7892
> URL: https://issues.apache.org/jira/browse/MESOS-7892
> Project: Mesos
>  Issue Type: Task
>  Components: agent
>Reporter: Andrei Budnik
>Assignee: Andrei Budnik
>  Labels: mesosphere, security
> Fix For: 1.5.0
>
>
> The results returned by {{/state}} include data about resource reservations 
> per each role, which should be filtered for certain users, particularly in a 
> multi-tenancy scenario.
> The kind of leaked data includes specific role names and their specific 
> reservations.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-6428) Mesos containerizer helper function signalSafeWriteStatus is not AS-Safe

2017-08-28 Thread Andrei Budnik (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16143878#comment-16143878
 ] 

Andrei Budnik commented on MESOS-6428:
--

https://reviews.apache.org/r/61800/

> Mesos containerizer helper function signalSafeWriteStatus is not AS-Safe
> 
>
> Key: MESOS-6428
> URL: https://issues.apache.org/jira/browse/MESOS-6428
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 1.1.0
>Reporter: Benjamin Bannier
>Assignee: Jing Chen
>  Labels: newbie, tech-debt
>
> In {{src/slave/containerizer/mesos/launch.cpp}} a helper function 
> {{signalSafeWriteStatus}} is defined. Its name seems to suggest that this 
> function is safe to call in e.g., signal handlers, and it is used in this 
> file's {{signalHandler}} for exactly that purpose.
> Currently this function is not AS-Safe since it e.g., allocates memory via 
> construction of {{string}} instances, and might destructively modify 
> {{errno}}.
> We should clean up this function to be in fact AS-Safe.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7586) Make use of cout/cerr and glog consistent.

2017-08-28 Thread Andrei Budnik (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrei Budnik updated MESOS-7586:
-
Description: 
Some parts of mesos use glog before initialization of glog, hence messages via 
glog might not end up in a logdir:
bq. WARNING: Logging before InitGoogleLogging() is written to STDERR

The solution might be:
{{cout/cerr}} should be used before logging initialization.
{{glog}} should be used after logging initialization.
 
Usually, main function has initialization pattern like:
# load = flags.load(argc, argv) // Load flags from command line.
# Check if flags are correct, otherwise print error message to cerr and then 
exit.
# Check if user passed --help flag to print help message to cout and then exit.
# Parsing and setup of environment variables. If this fails, EXIT macro is used 
to print error message via glog.
# process::initialize()
# logging::initialize()
 
Steps 2 and 3 should use {{cout/cerr}} to eliminate any extra information 
generated by glog like current time, date and log level.

It would be preferable to move step 6 between steps 3 and 4 safely, because 
{{logging::initialize()}} doesn’t depend on {{process::initialize()}}.
In addition, initialization of glog should be added, where it's necessary.

  was:
Some parts of mesos use glog before initialization of glog. This leads to 
message like:
bq. WARNING: Logging before InitGoogleLogging() is written to STDERR
Also, messages via glog before logging is initialized might not end up in a 
logdir.
 
The solution might be:
{{cout/cerr}} should be used before logging initialization.
{{glog}} should be used after logging initialization.
 
Usually, main function has initialization pattern like:
# load = flags.load(argc, argv) // Load flags from command line.
# Check if flags are correct, otherwise print error message to cerr and then 
exit.
# Check if user passed --help flag to print help message to cout and then exit.
# Parsing and setup of environment variables. If this fails, EXIT macro is used 
to print error message via glog.
# process::initialize()
# logging::initialize()
 
Steps 2 and 3 should use {{cout/cerr}} to eliminate any extra information 
generated by glog like current time, date and log level.

It would be preferable to move step 6 between steps 3 and 4 safely, because 
{{logging::initialize()}} doesn’t depend on {{process::initialize()}}.
In addition, initialization of glog should be added, where it's necessary.


> Make use of cout/cerr and glog consistent.
> --
>
> Key: MESOS-7586
> URL: https://issues.apache.org/jira/browse/MESOS-7586
> Project: Mesos
>  Issue Type: Bug
>Reporter: Andrei Budnik
>Assignee: Armand Grillet
>Priority: Minor
>  Labels: debugging, log, newbie
>
> Some parts of mesos use glog before initialization of glog, hence messages 
> via glog might not end up in a logdir:
> bq. WARNING: Logging before InitGoogleLogging() is written to STDERR
> The solution might be:
> {{cout/cerr}} should be used before logging initialization.
> {{glog}} should be used after logging initialization.
>  
> Usually, main function has initialization pattern like:
> # load = flags.load(argc, argv) // Load flags from command line.
> # Check if flags are correct, otherwise print error message to cerr and then 
> exit.
> # Check if user passed --help flag to print help message to cout and then 
> exit.
> # Parsing and setup of environment variables. If this fails, EXIT macro is 
> used to print error message via glog.
> # process::initialize()
> # logging::initialize()
>  
> Steps 2 and 3 should use {{cout/cerr}} to eliminate any extra information 
> generated by glog like current time, date and log level.
> It would be preferable to move step 6 between steps 3 and 4 safely, because 
> {{logging::initialize()}} doesn’t depend on {{process::initialize()}}.
> In addition, initialization of glog should be added, where it's necessary.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


<    1   2   3   4   >