date:20151202

[jira] [Commented] (MESOS-3892) Add a helper function to the Agent to retrieve the list of executors that are using optimistically offered, revocable resources.

2015-12-02 Thread Guangya Liu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15037416#comment-15037416
 ] 

Guangya Liu commented on MESOS-3892:


I know that we already have lot of discussions to talk whether the master can 
kill task or not. For some frameworks, it may not implement the 
{{termiateTask()}} so that once the executor was killed, the task will become 
TASK_FAILED and all resources will be recovered but task continues running, 
this will cause the host over committed, kubernetes on mesos is such a case, I 
filed a ticket here to trace the k8s on mesos issue: 
https://github.com/kubernetes/kubernetes/issues/18066

Killing task directly by master may not able to make sure QoS but it can make 
sure the resources usage is correct in case some executors do not implement 
{{terminateTask()}} API. Is it possible to add a new field in Framework to 
control kill task or executor when the tasks on this framework is being  
preempted.

> Add a helper function to the Agent to retrieve the list of executors that are 
> using optimistically offered, revocable resources.
> 
>
> Key: MESOS-3892
> URL: https://issues.apache.org/jira/browse/MESOS-3892
> Project: Mesos
>  Issue Type: Bug
>Reporter: Artem Harutyunyan
>Assignee: Klaus Ma
>  Labels: mesosphere
>
> {noformat}
> class Slave {
>   ...
>   // How the master currently keeps track of executors.
>   hashmap> executors;
>   ...
>   // Returns the list of executors that are using optimistically-
>   // offered, revocable resources.
>   list getEvictableExecutors() { ... }
>   ...
> }
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-4049) Allow user to control behavior of partitioned agents/tasks

2015-12-02 Thread Klaus Ma (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-4049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15037282#comment-15037282
 ] 

Klaus Ma commented on MESOS-4049:
-

And which timepoint would you like to report the new state to framework? Ping 
failed or configurable e.g. after # ping failed (< max_slave_ping_times)?

> Allow user to control behavior of partitioned agents/tasks
> --
>
> Key: MESOS-4049
> URL: https://issues.apache.org/jira/browse/MESOS-4049
> Project: Mesos
>  Issue Type: Improvement
>  Components: master, slave
>Reporter: Neil Conway
>  Labels: mesosphere
>
> At present, if an agent is partitioned away from the master, the master waits 
> for a period of time (see MESOS-4048) before deciding that the agent is dead. 
> Then it marks the agent as lost, sends {{TASK_LOST}} messages for all the 
> tasks running on the agent, and instructs the agent to shutdown.
> Although this behavior is desirable for some/many users, it is not ideal for 
> everyone. For example:
> * Some users might want to aggressively start a new replacement task (e.g., 
> after one or two ping timeouts are missed); then when the old copy of the 
> task comes back, they might want to make an intelligent decision about how to 
> reconcile this situation (e.g., kill old, kill new, allow both to continue 
> running).
> * Some frameworks might want different behavior from other frameworks, or to 
> treat some tasks differently from other tasks. For example, if a task has a 
> huge amount of state that would need to be regenerated to spin up another 
> instance, the user might want to wait longer before starting a new task to 
> increase the chance that the old task will reappear.
> To do this, we'd need to change task state so that a task can go from 
> {{RUNNING}} to a new state (say {{UNKNOWN}} or {{WANDERING}}), and then from 
> that state back to {{RUNNING}} (or perhaps we could keep the current 
> "mark-lost-after-timeout" behavior as an option, in which case {{UNKNOWN}} 
> could also transition to {{LOST}}). The agent would also keep its old 
> {{slaveId}} when it reconnects.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-4045) NumifyTest.HexNumberTest fails

2015-12-02 Thread Cong Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-4045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15037224#comment-15037224
 ] 

Cong Wang commented on MESOS-4045:
--

Oh... It passes on my Linux machine... I am trying to reproduce it on MacPro.

> NumifyTest.HexNumberTest fails
> --
>
> Key: MESOS-4045
> URL: https://issues.apache.org/jira/browse/MESOS-4045
> Project: Mesos
>  Issue Type: Bug
>  Components: stout
> Environment: Mac OS X 10.11.1
>Reporter: Michael Park
>
> {noformat}
> [==] Running 1 test from 1 test case.
> [--] Global test environment set-up.
> [--] 1 test from NumifyTest
> [ RUN  ] NumifyTest.HexNumberTest
> ../../../../3rdparty/libprocess/3rdparty/stout/tests/numify_tests.cpp:44: 
> Failure
> Value of: numify("0x10.9").isError()
>   Actual: false
> Expected: true
> [  FAILED  ] NumifyTest.HexNumberTest (0 ms)
> [--] 1 test from NumifyTest (0 ms total)
> [--] Global test environment tear-down
> [==] 1 test from 1 test case ran. (0 ms total)
> [  PASSED  ] 0 tests.
> [  FAILED  ] 1 test, listed below:
> [  FAILED  ] NumifyTest.HexNumberTest
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-4049) Allow user to control behavior of partitioned agents/tasks

2015-12-02 Thread Guangya Liu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-4049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15037207#comment-15037207
 ] 

Guangya Liu commented on MESOS-4049:


[~neilc]Got it, thanks! Then I think that we may need to consider the case if a 
UNKNOW or WANDERING task got killed? Shall we mark this as ZOMBIE and when the 
host come back, just mark the ZOMBIE as TASK_FINISH.

> Allow user to control behavior of partitioned agents/tasks
> --
>
> Key: MESOS-4049
> URL: https://issues.apache.org/jira/browse/MESOS-4049
> Project: Mesos
>  Issue Type: Improvement
>  Components: master, slave
>Reporter: Neil Conway
>  Labels: mesosphere
>
> At present, if an agent is partitioned away from the master, the master waits 
> for a period of time (see MESOS-4048) before deciding that the agent is dead. 
> Then it marks the agent as lost, sends {{TASK_LOST}} messages for all the 
> tasks running on the agent, and instructs the agent to shutdown.
> Although this behavior is desirable for some/many users, it is not ideal for 
> everyone. For example:
> * Some users might want to aggressively start a new replacement task (e.g., 
> after one or two ping timeouts are missed); then when the old copy of the 
> task comes back, they might want to make an intelligent decision about how to 
> reconcile this situation (e.g., kill old, kill new, allow both to continue 
> running).
> * Some frameworks might want different behavior from other frameworks, or to 
> treat some tasks differently from other tasks. For example, if a task has a 
> huge amount of state that would need to be regenerated to spin up another 
> instance, the user might want to wait longer before starting a new task to 
> increase the chance that the old task will reappear.
> To do this, we'd need to change task state so that a task can go from 
> {{RUNNING}} to a new state (say {{UNKNOWN}} or {{WANDERING}}), and then from 
> that state back to {{RUNNING}} (or perhaps we could keep the current 
> "mark-lost-after-timeout" behavior as an option, in which case {{UNKNOWN}} 
> could also transition to {{LOST}}). The agent would also keep its old 
> {{slaveId}} when it reconnects.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-4049) Allow user to control behavior of partitioned agents/tasks

2015-12-02 Thread Neil Conway (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-4049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15037154#comment-15037154
 ] 

Neil Conway commented on MESOS-4049:


Yes.

> Allow user to control behavior of partitioned agents/tasks
> --
>
> Key: MESOS-4049
> URL: https://issues.apache.org/jira/browse/MESOS-4049
> Project: Mesos
>  Issue Type: Improvement
>  Components: master, slave
>Reporter: Neil Conway
>  Labels: mesosphere
>
> At present, if an agent is partitioned away from the master, the master waits 
> for a period of time (see MESOS-4048) before deciding that the agent is dead. 
> Then it marks the agent as lost, sends {{TASK_LOST}} messages for all the 
> tasks running on the agent, and instructs the agent to shutdown.
> Although this behavior is desirable for some/many users, it is not ideal for 
> everyone. For example:
> * Some users might want to aggressively start a new replacement task (e.g., 
> after one or two ping timeouts are missed); then when the old copy of the 
> task comes back, they might want to make an intelligent decision about how to 
> reconcile this situation (e.g., kill old, kill new, allow both to continue 
> running).
> * Some frameworks might want different behavior from other frameworks, or to 
> treat some tasks differently from other tasks. For example, if a task has a 
> huge amount of state that would need to be regenerated to spin up another 
> instance, the user might want to wait longer before starting a new task to 
> increase the chance that the old task will reappear.
> To do this, we'd need to change task state so that a task can go from 
> {{RUNNING}} to a new state (say {{UNKNOWN}} or {{WANDERING}}), and then from 
> that state back to {{RUNNING}} (or perhaps we could keep the current 
> "mark-lost-after-timeout" behavior as an option, in which case {{UNKNOWN}} 
> could also transition to {{LOST}}). The agent would also keep its old 
> {{slaveId}} when it reconnects.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-4049) Allow user to control behavior of partitioned agents/tasks

2015-12-02 Thread Klaus Ma (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-4049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15037152#comment-15037152
 ] 

Klaus Ma commented on MESOS-4049:
-

I like {{replacement task}} feature :). Just want to confirm: in this JIRA, 
Mesos only provide a new state about connection glitch ({{UNKNOWN}} or 
{{WANDERING}}); "replacement task" is handled by framework.

> Allow user to control behavior of partitioned agents/tasks
> --
>
> Key: MESOS-4049
> URL: https://issues.apache.org/jira/browse/MESOS-4049
> Project: Mesos
>  Issue Type: Improvement
>  Components: master, slave
>Reporter: Neil Conway
>  Labels: mesosphere
>
> At present, if an agent is partitioned away from the master, the master waits 
> for a period of time (see MESOS-4048) before deciding that the agent is dead. 
> Then it marks the agent as lost, sends {{TASK_LOST}} messages for all the 
> tasks running on the agent, and instructs the agent to shutdown.
> Although this behavior is desirable for some/many users, it is not ideal for 
> everyone. For example:
> * Some users might want to aggressively start a new replacement task (e.g., 
> after one or two ping timeouts are missed); then when the old copy of the 
> task comes back, they might want to make an intelligent decision about how to 
> reconcile this situation (e.g., kill old, kill new, allow both to continue 
> running).
> * Some frameworks might want different behavior from other frameworks, or to 
> treat some tasks differently from other tasks. For example, if a task has a 
> huge amount of state that would need to be regenerated to spin up another 
> instance, the user might want to wait longer before starting a new task to 
> increase the chance that the old task will reappear.
> To do this, we'd need to change task state so that a task can go from 
> {{RUNNING}} to a new state (say {{UNKNOWN}} or {{WANDERING}}), and then from 
> that state back to {{RUNNING}} (or perhaps we could keep the current 
> "mark-lost-after-timeout" behavior as an option, in which case {{UNKNOWN}} 
> could also transition to {{LOST}}). The agent would also keep its old 
> {{slaveId}} when it reconnects.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-4048) Consider unifying slave timeout behavior between steady state and master failover

2015-12-02 Thread Klaus Ma (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-4048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15037144#comment-15037144
 ] 

Klaus Ma commented on MESOS-4048:
-

My understanding is that: {{max_slave_ping_timeouts}} + {{slave_ping_timeout}} 
is used to trigger TCP disconnected event; so master wait 
{{slave_reregister_timeout}} for slave to re-register. If master got TCP 
disconnected event, it should not ping Slave by {{max_slave_ping_timeouts}} + 
{{slave_ping_timeout}}.

{{max_slave_ping_timeouts}} + {{slave_ping_timeout}} is used to simulate 
TCP-KeepAlive which is not well supported in some OS.

> Consider unifying slave timeout behavior between steady state and master 
> failover
> -
>
> Key: MESOS-4048
> URL: https://issues.apache.org/jira/browse/MESOS-4048
> Project: Mesos
>  Issue Type: Improvement
>  Components: master, slave
>Reporter: Neil Conway
>Priority: Minor
>  Labels: mesosphere
>
> Currently, there are two timeouts that control what happens when an agent is 
> partitioned from the master:
> 1. {{max_slave_ping_timeouts}} + {{slave_ping_timeout}} controls how long the 
> master waits before declaring a slave to be dead in the "steady state"
> 2. {{slave_reregister_timeout}} controls how long the master waits for a 
> slave to reregister after master failover.
> It is unclear whether these two cases really merit being treated differently 
> -- it might be simpler for operators to configure a single timeout that 
> controls how long the master waits before declaring that a slave is dead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-4049) Allow user to control behavior of partitioned agents/tasks

2015-12-02 Thread Neil Conway (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-4049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15037116#comment-15037116
 ] 

Neil Conway commented on MESOS-4049:


I'm not sure {{ZOMBIE}} accurately describes the intended behavior -- for 
example, in Unix a zombie process cannot come back to life. A zombie process is 
definitely dead (it just hasn't been properly cleaned up), whereas in this case 
the true state of the task is not known (to the master/framework).

> Allow user to control behavior of partitioned agents/tasks
> --
>
> Key: MESOS-4049
> URL: https://issues.apache.org/jira/browse/MESOS-4049
> Project: Mesos
>  Issue Type: Improvement
>  Components: master, slave
>Reporter: Neil Conway
>  Labels: mesosphere
>
> At present, if an agent is partitioned away from the master, the master waits 
> for a period of time (see MESOS-4048) before deciding that the agent is dead. 
> Then it marks the agent as lost, sends {{TASK_LOST}} messages for all the 
> tasks running on the agent, and instructs the agent to shutdown.
> Although this behavior is desirable for some/many users, it is not ideal for 
> everyone. For example:
> * Some users might want to aggressively start a new replacement task (e.g., 
> after one or two ping timeouts are missed); then when the old copy of the 
> task comes back, they might want to make an intelligent decision about how to 
> reconcile this situation (e.g., kill old, kill new, allow both to continue 
> running).
> * Some frameworks might want different behavior from other frameworks, or to 
> treat some tasks differently from other tasks. For example, if a task has a 
> huge amount of state that would need to be regenerated to spin up another 
> instance, the user might want to wait longer before starting a new task to 
> increase the chance that the old task will reappear.
> To do this, we'd need to change task state so that a task can go from 
> {{RUNNING}} to a new state (say {{UNKNOWN}} or {{WANDERING}}), and then from 
> that state back to {{RUNNING}} (or perhaps we could keep the current 
> "mark-lost-after-timeout" behavior as an option, in which case {{UNKNOWN}} 
> could also transition to {{LOST}}). The agent would also keep its old 
> {{slaveId}} when it reconnects.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-4049) Allow user to control behavior of partitioned agents/tasks

2015-12-02 Thread Guangya Liu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-4049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15037109#comment-15037109
 ] 

Guangya Liu commented on MESOS-4049:


It would be great to add such intelligent feature. BTW: It might be more align 
with linux concept if we can name the transaction task state as "ZOMBIE".

> Allow user to control behavior of partitioned agents/tasks
> --
>
> Key: MESOS-4049
> URL: https://issues.apache.org/jira/browse/MESOS-4049
> Project: Mesos
>  Issue Type: Improvement
>  Components: master, slave
>Reporter: Neil Conway
>  Labels: mesosphere
>
> At present, if an agent is partitioned away from the master, the master waits 
> for a period of time (see MESOS-4048) before deciding that the agent is dead. 
> Then it marks the agent as lost, sends {{TASK_LOST}} messages for all the 
> tasks running on the agent, and instructs the agent to shutdown.
> Although this behavior is desirable for some/many users, it is not ideal for 
> everyone. For example:
> * Some users might want to aggressively start a new replacement task (e.g., 
> after one or two ping timeouts are missed); then when the old copy of the 
> task comes back, they might want to make an intelligent decision about how to 
> reconcile this situation (e.g., kill old, kill new, allow both to continue 
> running).
> * Some frameworks might want different behavior from other frameworks, or to 
> treat some tasks differently from other tasks. For example, if a task has a 
> huge amount of state that would need to be regenerated to spin up another 
> instance, the user might want to wait longer before starting a new task to 
> increase the chance that the old task will reappear.
> To do this, we'd need to change task state so that a task can go from 
> {{RUNNING}} to a new state (say {{UNKNOWN}} or {{WANDERING}}), and then from 
> that state back to {{RUNNING}} (or perhaps we could keep the current 
> "mark-lost-after-timeout" behavior as an option, in which case {{UNKNOWN}} 
> could also transition to {{LOST}}). The agent would also keep its old 
> {{slaveId}} when it reconnects.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-4036) Install instructions for CentOS 6.6 lead to errors running `perf`

2015-12-02 Thread Greg Mann (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-4036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Mann updated MESOS-4036:
-
Summary: Install instructions for CentOS 6.6 lead to errors running `perf`  
(was: perf will not run on CentOS 6.6)

> Install instructions for CentOS 6.6 lead to errors running `perf`
> -
>
> Key: MESOS-4036
> URL: https://issues.apache.org/jira/browse/MESOS-4036
> Project: Mesos
>  Issue Type: Bug
> Environment: CentOS 6.6
>Reporter: Greg Mann
>  Labels: mesosphere
>
> After using the current installation instructions in the getting started 
> documentation, {{perf}} will not run on CentOS 6.6 because the version of 
> elfutils included in devtoolset-2 is not compatible with the version of 
> {{perf}} installed by {{yum}}. Installing and using devtoolset-3, however 
> (http://linux.web.cern.ch/linux/scientific6/docs/softwarecollections.shtml) 
> fixes this issue. This could be resolved by updating the getting started 
> documentation to recommend installing devtoolset-3.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-4053) MemoryPressureMesosTest tests fail on CentOS 6.6

2015-12-02 Thread Greg Mann (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-4053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Mann updated MESOS-4053:
-
Description: 
{{MemoryPressureMesosTest.CGROUPS_ROOT_Statistics}} and 
{{MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery}} fail on CentOS 6.6. It 
seems that mounted cgroups are not properly cleaned up after previous tests, so 
multiple hierarchies are detected and thus an error is produced:

{code}
[ RUN  ] MemoryPressureMesosTest.CGROUPS_ROOT_Statistics
../../src/tests/mesos.cpp:849: Failure
Value of: _baseHierarchy.get()
  Actual: "/cgroup"
Expected: baseHierarchy
Which is: "/tmp/mesos_test_cgroup"
-
Multiple cgroups base hierarchies detected:
  '/tmp/mesos_test_cgroup'
  '/cgroup'
Mesos does not support multiple cgroups base hierarchies.
Please unmount the corresponding (or all) subsystems.
-
../../src/tests/mesos.cpp:932: Failure
(cgroups::destroy(hierarchy, cgroup)).failure(): Failed to remove cgroup 
'/tmp/mesos_test_cgroup/perf_event/mesos_test': Device or resource busy
[  FAILED  ] MemoryPressureMesosTest.CGROUPS_ROOT_Statistics (12 ms)
[ RUN  ] MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery
../../src/tests/mesos.cpp:849: Failure
Value of: _baseHierarchy.get()
  Actual: "/cgroup"
Expected: baseHierarchy
Which is: "/tmp/mesos_test_cgroup"
-
Multiple cgroups base hierarchies detected:
  '/tmp/mesos_test_cgroup'
  '/cgroup'
Mesos does not support multiple cgroups base hierarchies.
Please unmount the corresponding (or all) subsystems.
-
../../src/tests/mesos.cpp:932: Failure
(cgroups::destroy(hierarchy, cgroup)).failure(): Failed to remove cgroup 
'/tmp/mesos_test_cgroup/perf_event/mesos_test': Device or resource busy
[  FAILED  ] MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery (7 ms)
{code}

  was:
{{MemoryPressureMesosTest.CGROUPS_ROOT_Statistics}} and 
{{MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery}} fail on CentOS 6.6. It 
seems the tests fail to correctly identify the base cgroups hierarchy:

{code}
[ RUN  ] MemoryPressureMesosTest.CGROUPS_ROOT_Statistics
../../src/tests/mesos.cpp:849: Failure
Value of: _baseHierarchy.get()
  Actual: "/cgroup"
Expected: baseHierarchy
Which is: "/tmp/mesos_test_cgroup"
-
Multiple cgroups base hierarchies detected:
  '/tmp/mesos_test_cgroup'
  '/cgroup'
Mesos does not support multiple cgroups base hierarchies.
Please unmount the corresponding (or all) subsystems.
-
../../src/tests/mesos.cpp:932: Failure
(cgroups::destroy(hierarchy, cgroup)).failure(): Failed to remove cgroup 
'/tmp/mesos_test_cgroup/perf_event/mesos_test': Device or resource busy
[  FAILED  ] MemoryPressureMesosTest.CGROUPS_ROOT_Statistics (12 ms)
[ RUN  ] MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery
../../src/tests/mesos.cpp:849: Failure
Value of: _baseHierarchy.get()
  Actual: "/cgroup"
Expected: baseHierarchy
Which is: "/tmp/mesos_test_cgroup"
-
Multiple cgroups base hierarchies detected:
  '/tmp/mesos_test_cgroup'
  '/cgroup'
Mesos does not support multiple cgroups base hierarchies.
Please unmount the corresponding (or all) subsystems.
-
../../src/tests/mesos.cpp:932: Failure
(cgroups::destroy(hierarchy, cgroup)).failure(): Failed to remove cgroup 
'/tmp/mesos_test_cgroup/perf_event/mesos_test': Device or resource busy
[  FAILED  ] MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery (7 ms)
{code}


> MemoryPressureMesosTest tests fail on CentOS 6.6
> 
>
> Key: MESOS-4053
> URL: https://issues.apache.org/jira/browse/MESOS-4053
> Project: Mesos
>  Issue Type: Bug
> Environment: CentOS 6.6
>Reporter: Greg Mann
>  Labels: mesosphere, test-failure
>
> {{MemoryPressureMesosTest.CGROUPS_ROOT_Statistics}} and 
> {{MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery}} fail on CentOS 6.6. It 
> seems that mounted cgroups are not properly cleaned up after previous tests, 
> so multiple hierarchies are detected and thus an error is produced:
> {code}
> [ RUN  ] MemoryPressureMesosTest.CGROUPS_ROOT_Statistics
> ../../src/tests/mesos.cpp:849: Failure
> Value of: _baseHierarchy.get()
>   Actual: "/cgroup"
> Expected: baseHierarchy
> Which is: "/tmp/mesos_test_cgroup"
> -
> Multiple cgroups base hierarchies detected:
>   '/tmp/mesos_test_cgroup'
>   '/cgroup'
> Mesos does not support multiple cgroups base hierarchies.
> Ple

[jira] [Updated] (MESOS-4053) MemoryPressureMesosTest tests fail on CentOS 6.6

2015-12-02 Thread Greg Mann (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-4053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Mann updated MESOS-4053:
-
Description: 
{{MemoryPressureMesosTest.CGROUPS_ROOT_Statistics}} and 
{{MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery}} fail on CentOS 6.6. It 
seems the tests fail to correctly identify the base cgroups hierarchy:

{code}
[ RUN  ] MemoryPressureMesosTest.CGROUPS_ROOT_Statistics
../../src/tests/mesos.cpp:849: Failure
Value of: _baseHierarchy.get()
  Actual: "/cgroup"
Expected: baseHierarchy
Which is: "/tmp/mesos_test_cgroup"
-
Multiple cgroups base hierarchies detected:
  '/tmp/mesos_test_cgroup'
  '/cgroup'
Mesos does not support multiple cgroups base hierarchies.
Please unmount the corresponding (or all) subsystems.
-
../../src/tests/mesos.cpp:932: Failure
(cgroups::destroy(hierarchy, cgroup)).failure(): Failed to remove cgroup 
'/tmp/mesos_test_cgroup/perf_event/mesos_test': Device or resource busy
[  FAILED  ] MemoryPressureMesosTest.CGROUPS_ROOT_Statistics (12 ms)
[ RUN  ] MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery
../../src/tests/mesos.cpp:849: Failure
Value of: _baseHierarchy.get()
  Actual: "/cgroup"
Expected: baseHierarchy
Which is: "/tmp/mesos_test_cgroup"
-
Multiple cgroups base hierarchies detected:
  '/tmp/mesos_test_cgroup'
  '/cgroup'
Mesos does not support multiple cgroups base hierarchies.
Please unmount the corresponding (or all) subsystems.
-
../../src/tests/mesos.cpp:932: Failure
(cgroups::destroy(hierarchy, cgroup)).failure(): Failed to remove cgroup 
'/tmp/mesos_test_cgroup/perf_event/mesos_test': Device or resource busy
[  FAILED  ] MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery (7 ms)
{code}

  was:
{{MemoryPressureMesosTest.CGROUPS_ROOT_Statistics}} and 
{{MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery}} fail on CentOS 6.6. It 
seems the tests fail to correctly identify the base cgroups hierarchy:

{code}
[ RUN  ] CgroupsAnyHierarchyMemoryPressureTest.ROOT_IncreasePageCache
[   OK ] CgroupsAnyHierarchyMemoryPressureTest.ROOT_IncreasePageCache (2091 
ms)
../../src/tests/containerizer/cgroups_tests.cpp:84: Failure
(cgroups::cleanup(TEST_CGROUPS_HIERARCHY)).failure(): Operation not permitted
[--] 2 tests from CgroupsAnyHierarchyMemoryPressureTest (2109 ms total)

[--] 2 tests from MemoryPressureMesosTest
1+0 records in
1+0 records out
1048576 bytes (1.0 MB) copied, 0.0011065 s, 948 MB/s
[ RUN  ] MemoryPressureMesosTest.CGROUPS_ROOT_Statistics
../../src/tests/mesos.cpp:849: Failure
Value of: _baseHierarchy.get()
  Actual: "/cgroup"
Expected: baseHierarchy
Which is: "/tmp/mesos_test_cgroup"
-
Multiple cgroups base hierarchies detected:
  '/tmp/mesos_test_cgroup'
  '/cgroup'
Mesos does not support multiple cgroups base hierarchies.
Please unmount the corresponding (or all) subsystems.
-
../../src/tests/mesos.cpp:932: Failure
(cgroups::destroy(hierarchy, cgroup)).failure(): Failed to remove cgroup 
'/tmp/mesos_test_cgroup/perf_event/mesos_test': Device or resource busy
[  FAILED  ] MemoryPressureMesosTest.CGROUPS_ROOT_Statistics (12 ms)
[ RUN  ] MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery
../../src/tests/mesos.cpp:849: Failure
Value of: _baseHierarchy.get()
  Actual: "/cgroup"
Expected: baseHierarchy
Which is: "/tmp/mesos_test_cgroup"
-
Multiple cgroups base hierarchies detected:
  '/tmp/mesos_test_cgroup'
  '/cgroup'
Mesos does not support multiple cgroups base hierarchies.
Please unmount the corresponding (or all) subsystems.
-
../../src/tests/mesos.cpp:932: Failure
(cgroups::destroy(hierarchy, cgroup)).failure(): Failed to remove cgroup 
'/tmp/mesos_test_cgroup/perf_event/mesos_test': Device or resource busy
[  FAILED  ] MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery (7 ms)
{code}


> MemoryPressureMesosTest tests fail on CentOS 6.6
> 
>
> Key: MESOS-4053
> URL: https://issues.apache.org/jira/browse/MESOS-4053
> Project: Mesos
>  Issue Type: Bug
> Environment: CentOS 6.6
>Reporter: Greg Mann
>  Labels: mesosphere, test-failure
>
> {{MemoryPressureMesosTest.CGROUPS_ROOT_Statistics}} and 
> {{MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery}} fail on CentOS 6.6. It 
> seems the tests fail to correctly identify the base cgroups hierarchy:
> {code}
> [ RUN  ] MemoryPressureMesosTest.CGROUPS_ROOT_Statistics
> ../../src/tests

[jira] [Created] (MESOS-4053) MemoryPressureMesosTest tests fail on CentOS 6.6

2015-12-02 Thread Greg Mann (JIRA)

Greg Mann created MESOS-4053:


 Summary: MemoryPressureMesosTest tests fail on CentOS 6.6
 Key: MESOS-4053
 URL: https://issues.apache.org/jira/browse/MESOS-4053
 Project: Mesos
  Issue Type: Bug
 Environment: CentOS 6.6
Reporter: Greg Mann


{{MemoryPressureMesosTest.CGROUPS_ROOT_Statistics}} and 
{{MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery}} fail on CentOS 6.6. It 
seems the tests fail to correctly identify the base cgroups hierarchy:

{code}
[ RUN  ] CgroupsAnyHierarchyMemoryPressureTest.ROOT_IncreasePageCache
[   OK ] CgroupsAnyHierarchyMemoryPressureTest.ROOT_IncreasePageCache (2091 
ms)
../../src/tests/containerizer/cgroups_tests.cpp:84: Failure
(cgroups::cleanup(TEST_CGROUPS_HIERARCHY)).failure(): Operation not permitted
[--] 2 tests from CgroupsAnyHierarchyMemoryPressureTest (2109 ms total)

[--] 2 tests from MemoryPressureMesosTest
1+0 records in
1+0 records out
1048576 bytes (1.0 MB) copied, 0.0011065 s, 948 MB/s
[ RUN  ] MemoryPressureMesosTest.CGROUPS_ROOT_Statistics
../../src/tests/mesos.cpp:849: Failure
Value of: _baseHierarchy.get()
  Actual: "/cgroup"
Expected: baseHierarchy
Which is: "/tmp/mesos_test_cgroup"
-
Multiple cgroups base hierarchies detected:
  '/tmp/mesos_test_cgroup'
  '/cgroup'
Mesos does not support multiple cgroups base hierarchies.
Please unmount the corresponding (or all) subsystems.
-
../../src/tests/mesos.cpp:932: Failure
(cgroups::destroy(hierarchy, cgroup)).failure(): Failed to remove cgroup 
'/tmp/mesos_test_cgroup/perf_event/mesos_test': Device or resource busy
[  FAILED  ] MemoryPressureMesosTest.CGROUPS_ROOT_Statistics (12 ms)
[ RUN  ] MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery
../../src/tests/mesos.cpp:849: Failure
Value of: _baseHierarchy.get()
  Actual: "/cgroup"
Expected: baseHierarchy
Which is: "/tmp/mesos_test_cgroup"
-
Multiple cgroups base hierarchies detected:
  '/tmp/mesos_test_cgroup'
  '/cgroup'
Mesos does not support multiple cgroups base hierarchies.
Please unmount the corresponding (or all) subsystems.
-
../../src/tests/mesos.cpp:932: Failure
(cgroups::destroy(hierarchy, cgroup)).failure(): Failed to remove cgroup 
'/tmp/mesos_test_cgroup/perf_event/mesos_test': Device or resource busy
[  FAILED  ] MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery (7 ms)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (MESOS-4052) Simple hook implementation proxying out to another daemon process

2015-12-02 Thread Zhitao Li (JIRA)

Zhitao Li created MESOS-4052:


 Summary: Simple hook implementation proxying out to another daemon 
process
 Key: MESOS-4052
 URL: https://issues.apache.org/jira/browse/MESOS-4052
 Project: Mesos
  Issue Type: Wish
  Components: modules
Reporter: Zhitao Li
Priority: Minor


Right now, if Mesos user needs hooks like slavePreLaunchDockerHook, they would 
need to maintain the compiling, building and packaging of a dynamically linked 
library in c++ in house.

Designs like [Docker's Volume 
plugin|https://docs.docker.com/engine/extend/plugins_volume/] simply requires 
user to implement a predefined REST API in any language and listen at a domain 
socket. This would be more flexible for companies that does not use c++ as 
primary language.

This ticket is exploring the possibility of whether Mesos could provide a 
default module that 1) defines such API and 2) proxies out to the external 
agent for any heavy lifting.

I'm more than happy to work on this than maintain this hook in house in the 
longer term.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-4052) Simple hook implementation proxying out to another daemon process

2015-12-02 Thread Zhitao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-4052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhitao Li updated MESOS-4052:
-
Description: 
Right now, if Mesos user needs hooks like slavePreLaunchDockerHook, they would 
need to maintain the compiling, building and packaging of a dynamically linked 
library in c++ in house.

Designs like [Docker's Volume 
plugin|https://docs.docker.com/engine/extend/plugins_volume/] simply requires 
user to implement a predefined REST API in any language and listen at a domain 
socket. This would be more flexible for companies that does not use c++ as 
primary language.

This ticket is exploring the possibility of whether Mesos could provide a 
default module that 1) defines such API and 2) proxies out to the external 
agent for any heavy lifting.

Please let me know whether you think is seems like a reasonable 
feature/requirement.

I'm more than happy to work on this than maintain this hook in house in the 
longer term.

  was:
Right now, if Mesos user needs hooks like slavePreLaunchDockerHook, they would 
need to maintain the compiling, building and packaging of a dynamically linked 
library in c++ in house.

Designs like [Docker's Volume 
plugin|https://docs.docker.com/engine/extend/plugins_volume/] simply requires 
user to implement a predefined REST API in any language and listen at a domain 
socket. This would be more flexible for companies that does not use c++ as 
primary language.

This ticket is exploring the possibility of whether Mesos could provide a 
default module that 1) defines such API and 2) proxies out to the external 
agent for any heavy lifting.

I'm more than happy to work on this than maintain this hook in house in the 
longer term.


> Simple hook implementation proxying out to another daemon process
> -
>
> Key: MESOS-4052
> URL: https://issues.apache.org/jira/browse/MESOS-4052
> Project: Mesos
>  Issue Type: Wish
>  Components: modules
>Reporter: Zhitao Li
>Priority: Minor
>
> Right now, if Mesos user needs hooks like slavePreLaunchDockerHook, they 
> would need to maintain the compiling, building and packaging of a dynamically 
> linked library in c++ in house.
> Designs like [Docker's Volume 
> plugin|https://docs.docker.com/engine/extend/plugins_volume/] simply requires 
> user to implement a predefined REST API in any language and listen at a 
> domain socket. This would be more flexible for companies that does not use 
> c++ as primary language.
> This ticket is exploring the possibility of whether Mesos could provide a 
> default module that 1) defines such API and 2) proxies out to the external 
> agent for any heavy lifting.
> Please let me know whether you think is seems like a reasonable 
> feature/requirement.
> I'm more than happy to work on this than maintain this hook in house in the 
> longer term.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (MESOS-4051) Support passing docker image env and user env to docker containerizer

2015-12-02 Thread Gilbert Song (JIRA)

Gilbert Song created MESOS-4051:
---

 Summary: Support passing docker image env and user env to docker 
containerizer
 Key: MESOS-4051
 URL: https://issues.apache.org/jira/browse/MESOS-4051
 Project: Mesos
  Issue Type: Improvement
  Components: containerization, docker
Reporter: Gilbert Song
Assignee: Gilbert Song


Currently we only pass slave env to docker containerizer when docker 
containerizer launches executor container with docker run. We should be able to 
support passing docker image env and user taskInfo env to docker containerizer, 
with the following priority:

1. User taskInfo Env (specified in commandInfo). 
2. Docker image Env.
3. Mesos slave Env. 

We are following this priority to merge. If any is duplicated, overwrite all 
defined env variables depending on the order above. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-4049) Allow user to control behavior of partitioned agents/tasks

2015-12-02 Thread Vinod Kone (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-4049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15036599#comment-15036599
 ] 

Vinod Kone commented on MESOS-4049:
---

+100

> Allow user to control behavior of partitioned agents/tasks
> --
>
> Key: MESOS-4049
> URL: https://issues.apache.org/jira/browse/MESOS-4049
> Project: Mesos
>  Issue Type: Improvement
>  Components: master, slave
>Reporter: Neil Conway
>  Labels: mesosphere
>
> At present, if an agent is partitioned away from the master, the master waits 
> for a period of time (see MESOS-4048) before deciding that the agent is dead. 
> Then it marks the agent as lost, sends {{TASK_LOST}} messages for all the 
> tasks running on the agent, and instructs the agent to shutdown.
> Although this behavior is desirable for some/many users, it is not ideal for 
> everyone. For example:
> * Some users might want to aggressively start a new replacement task (e.g., 
> after one or two ping timeouts are missed); then when the old copy of the 
> task comes back, they might want to make an intelligent decision about how to 
> reconcile this situation (e.g., kill old, kill new, allow both to continue 
> running).
> * Some frameworks might want different behavior from other frameworks, or to 
> treat some tasks differently from other tasks. For example, if a task has a 
> huge amount of state that would need to be regenerated to spin up another 
> instance, the user might want to wait longer before starting a new task to 
> increase the chance that the old task will reappear.
> To do this, we'd need to change task state so that a task can go from 
> {{RUNNING}} to a new state (say {{UNKNOWN}} or {{WANDERING}}), and then from 
> that state back to {{RUNNING}} (or perhaps we could keep the current 
> "mark-lost-after-timeout" behavior as an option, in which case {{UNKNOWN}} 
> could also transition to {{LOST}}). The agent would also keep its old 
> {{slaveId}} when it reconnects.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-3977) http::_operation() creates unnecessary filter, rescinds unnecessarily

2015-12-02 Thread Neil Conway (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-3977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neil Conway updated MESOS-3977:
---
Description: 
This function is used by the /reserve, /unreserve, /create-volumes, and 
/destroy-volumes endpoints. It has a few worts:

1. It installs a 5-second filter when rescinding an offer. However, the cluster 
state might change so that the filter is actually undesirable. For example, 
this scenario:
* Create DR, make offer
* Create PV => rescinds previous offer, sets filter, makes offer
* Destroy PV => rescinds previous offer
After the last step, we'll wait 5 seconds for the filter to expire before 
re-offering the DR.

2. If there are sufficient available resources at the target slave, we don't 
actually need to rescind any offers in the first place. However, _operation() 
rescinds offers unconditionally.

  was:
This function is used by the /reserve, /unreserve, /create-volume, and 
/destroy-volume endpoints. It has a few worts:

1. It installs a 5-second filter when rescinding an offer. However, the cluster 
state might change so that the filter is actually undesirable. For example, 
this scenario:
* Create DR, make offer
* Create PV => rescinds previous offer, sets filter, makes offer
* Destroy PV => rescinds previous offer
After the last step, we'll wait 5 seconds for the filter to expire before 
re-offering the DR.

2. If there are sufficient available resources at the target slave, we don't 
actually need to rescind any offers in the first place. However, _operation() 
rescinds offers unconditionally.


> http::_operation() creates unnecessary filter, rescinds unnecessarily
> -
>
> Key: MESOS-3977
> URL: https://issues.apache.org/jira/browse/MESOS-3977
> Project: Mesos
>  Issue Type: Bug
>Reporter: Neil Conway
>Priority: Minor
>  Labels: mesosphere, reservations
>
> This function is used by the /reserve, /unreserve, /create-volumes, and 
> /destroy-volumes endpoints. It has a few worts:
> 1. It installs a 5-second filter when rescinding an offer. However, the 
> cluster state might change so that the filter is actually undesirable. For 
> example, this scenario:
> * Create DR, make offer
> * Create PV => rescinds previous offer, sets filter, makes offer
> * Destroy PV => rescinds previous offer
> After the last step, we'll wait 5 seconds for the filter to expire before 
> re-offering the DR.
> 2. If there are sufficient available resources at the target slave, we don't 
> actually need to rescind any offers in the first place. However, _operation() 
> rescinds offers unconditionally.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-3987) /create-volumes, /destroy-volumes should be permissive under a master without authentication.

2015-12-02 Thread Neil Conway (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-3987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neil Conway updated MESOS-3987:
---
Summary: /create-volumes, /destroy-volumes should be permissive under a 
master without authentication.  (was: /create-volume, /destroy-volume should be 
permissive under a master without authentication.)

> /create-volumes, /destroy-volumes should be permissive under a master without 
> authentication.
> -
>
> Key: MESOS-3987
> URL: https://issues.apache.org/jira/browse/MESOS-3987
> Project: Mesos
>  Issue Type: Bug
>Reporter: Neil Conway
>  Labels: authentication, mesosphere, persistent-volumes
>
> See MESOS-3940 for details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (MESOS-4050) Change task reconciliation not omit unknown tasks

2015-12-02 Thread Neil Conway (JIRA)

Neil Conway created MESOS-4050:
--

 Summary: Change task reconciliation not omit unknown tasks
 Key: MESOS-4050
 URL: https://issues.apache.org/jira/browse/MESOS-4050
 Project: Mesos
  Issue Type: Improvement
  Components: framework, master
Reporter: Neil Conway


If a framework tries to reconcile the state of a task that is in an unknown 
state (because the agent running the task is partitioned from the master), the 
master will _not_ include any information about that task.

This is confusing for framework authors. It seems better for the master to 
announce all the information it has explicitly: e.g., to return "task X is in 
an unknown state", rather than not returning anything. Then as more information 
arrives (e.g., task returns or task definitively dies), task state would 
transition appropriately.

This might be consistent with changing the task states so that we capture "task 
is partitioned" as an explicit task state ({{TASK_UNKNOWN}} or 
{{TASK_WANDERING}}) -- see MESOS-4049.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (MESOS-4049) Allow user to control behavior of partitioned agents/tasks

2015-12-02 Thread Neil Conway (JIRA)

Neil Conway created MESOS-4049:
--

 Summary: Allow user to control behavior of partitioned agents/tasks
 Key: MESOS-4049
 URL: https://issues.apache.org/jira/browse/MESOS-4049
 Project: Mesos
  Issue Type: Improvement
  Components: master, slave
Reporter: Neil Conway


At present, if an agent is partitioned away from the master, the master waits 
for a period of time (see MESOS-4048) before deciding that the agent is dead. 
Then it marks the agent as lost, sends {{TASK_LOST}} messages for all the tasks 
running on the agent, and instructs the agent to shutdown.

Although this behavior is desirable for some/many users, it is not ideal for 
everyone. For example:

* Some users might want to aggressively start a new replacement task (e.g., 
after one or two ping timeouts are missed); then when the old copy of the task 
comes back, they might want to make an intelligent decision about how to 
reconcile this situation (e.g., kill old, kill new, allow both to continue 
running).
* Some frameworks might want different behavior from other frameworks, or to 
treat some tasks differently from other tasks. For example, if a task has a 
huge amount of state that would need to be regenerated to spin up another 
instance, the user might want to wait longer before starting a new task to 
increase the chance that the old task will reappear.

To do this, we'd need to change task state so that a task can go from 
{{RUNNING}} to a new state (say {{UNKNOWN}} or {{WANDERING}}), and then from 
that state back to {{RUNNING}} (or perhaps we could keep the current 
"mark-lost-after-timeout" behavior as an option, in which case {{UNKNOWN}} 
could also transition to {{LOST}}). The agent would also keep its old 
{{slaveId}} when it reconnects.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-4045) NumifyTest.HexNumberTest fails

2015-12-02 Thread Michael Park (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-4045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Park updated MESOS-4045:

Environment: Mac OS X 10.11.1

> NumifyTest.HexNumberTest fails
> --
>
> Key: MESOS-4045
> URL: https://issues.apache.org/jira/browse/MESOS-4045
> Project: Mesos
>  Issue Type: Bug
>  Components: stout
> Environment: Mac OS X 10.11.1
>Reporter: Michael Park
>
> {noformat}
> [==] Running 1 test from 1 test case.
> [--] Global test environment set-up.
> [--] 1 test from NumifyTest
> [ RUN  ] NumifyTest.HexNumberTest
> ../../../../3rdparty/libprocess/3rdparty/stout/tests/numify_tests.cpp:44: 
> Failure
> Value of: numify("0x10.9").isError()
>   Actual: false
> Expected: true
> [  FAILED  ] NumifyTest.HexNumberTest (0 ms)
> [--] 1 test from NumifyTest (0 ms total)
> [--] Global test environment tear-down
> [==] 1 test from 1 test case ran. (0 ms total)
> [  PASSED  ] 0 tests.
> [  FAILED  ] 1 test, listed below:
> [  FAILED  ] NumifyTest.HexNumberTest
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (MESOS-4048) Consider unifying slave timeout behavior between steady state and master failover

2015-12-02 Thread Neil Conway (JIRA)

Neil Conway created MESOS-4048:
--

 Summary: Consider unifying slave timeout behavior between steady 
state and master failover
 Key: MESOS-4048
 URL: https://issues.apache.org/jira/browse/MESOS-4048
 Project: Mesos
  Issue Type: Improvement
  Components: master, slave
Reporter: Neil Conway
Priority: Minor


Currently, there are two timeouts that control what happens when an agent is 
partitioned from the master:

1. {{max_slave_ping_timeouts}} + {{slave_ping_timeout}} controls how long the 
master waits before declaring a slave to be dead in the "steady state"
2. {{slave_reregister_timeout}} controls how long the master waits for a slave 
to reregister after master failover.

It is unclear whether these two cases really merit being treated differently -- 
it might be simpler for operators to configure a single timeout that controls 
how long the master waits before declaring that a slave is dead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (MESOS-4047) MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery is flaky

2015-12-02 Thread Joseph Wu (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-4047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu reassigned MESOS-4047:


Assignee: Joseph Wu

> MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery is flaky
> ---
>
> Key: MESOS-4047
> URL: https://issues.apache.org/jira/browse/MESOS-4047
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Affects Versions: 0.26.0
> Environment: Ubuntu 14, gcc 4.8.4
>Reporter: Joseph Wu
>Assignee: Joseph Wu
>  Labels: flaky, flaky-test
>
> {code:title=Output from passed test}
> [--] 1 test from MemoryPressureMesosTest
> 1+0 records in
> 1+0 records out
> 1048576 bytes (1.0 MB) copied, 0.000430889 s, 2.4 GB/s
> [ RUN  ] MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery
> I1202 11:09:14.319327  5062 exec.cpp:134] Version: 0.27.0
> I1202 11:09:14.17  5079 exec.cpp:208] Executor registered on slave 
> bea15b35-9aa1-4b57-96fb-29b5f70638ac-S0
> Registered executor on ubuntu
> Starting task 4e62294c-cfcf-4a13-b699-c6a4b7ac5162
> sh -c 'while true; do dd count=512 bs=1M if=/dev/zero of=./temp; done'
> Forked command at 5085
> I1202 11:09:14.391739  5077 exec.cpp:254] Received reconnect request from 
> slave bea15b35-9aa1-4b57-96fb-29b5f70638ac-S0
> I1202 11:09:14.398598  5082 exec.cpp:231] Executor re-registered on slave 
> bea15b35-9aa1-4b57-96fb-29b5f70638ac-S0
> Re-registered executor on ubuntu
> Shutting down
> Sending SIGTERM to process tree at pid 5085
> Killing the following process trees:
> [ 
> -+- 5085 sh -c while true; do dd count=512 bs=1M if=/dev/zero of=./temp; done 
>  \--- 5086 dd count=512 bs=1M if=/dev/zero of=./temp 
> ]
> [   OK ] MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery (1096 ms)
> {code}
> {code:title=Output from failed test}
> [--] 1 test from MemoryPressureMesosTest
> 1+0 records in
> 1+0 records out
> 1048576 bytes (1.0 MB) copied, 0.000404489 s, 2.6 GB/s
> [ RUN  ] MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery
> I1202 11:09:15.509950  5109 exec.cpp:134] Version: 0.27.0
> I1202 11:09:15.568183  5123 exec.cpp:208] Executor registered on slave 
> 88734acc-718e-45b0-95b9-d8f07cea8a9e-S0
> Registered executor on ubuntu
> Starting task 14b6bab9-9f60-4130-bdc4-44efba262bc6
> Forked command at 5132
> sh -c 'while true; do dd count=512 bs=1M if=/dev/zero of=./temp; done'
> I1202 11:09:15.665498  5129 exec.cpp:254] Received reconnect request from 
> slave 88734acc-718e-45b0-95b9-d8f07cea8a9e-S0
> I1202 11:09:15.670995  5123 exec.cpp:381] Executor asked to shutdown
> Shutting down
> Sending SIGTERM to process tree at pid 5132
> ../../src/tests/containerizer/memory_pressure_tests.cpp:283: Failure
> (usage).failure(): Unknown container: ebe90e15-72fa-4519-837b-62f43052c913
> *** Aborted at 1449083355 (unix time) try "date -d @1449083355" if you are 
> using GNU date ***
> {code}
> Notice that in the failed test, the executor is asked to shutdown when it 
> tries to reconnect to the agent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-4047) MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery is flaky

2015-12-02 Thread Joseph Wu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-4047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15036465#comment-15036465
 ] 

Joseph Wu commented on MESOS-4047:
--

Note: {{MesosContainerizerSlaveRecoveryTest.ResourceStatistics}} has similar 
logic for restarting the agent, re-registering an executor, and [calling 
{{MesosContainerizer::usage}}|https://github.com/apache/mesos/blob/master/src/tests/slave_recovery_tests.cpp#L3267].
  But this test is stable.  

The flaky test waits on:
{code}
  Future _recover = FUTURE_DISPATCH(_, &Slave::_recover);

  Future slaveReregisteredMessage =
FUTURE_PROTOBUF(SlaveReregisteredMessage(), _, _);
{code}

Whereas the stable test waits on:
{code}
  // Set up so we can wait until the new slave updates the container's
  // resources (this occurs after the executor has re-registered).
  Future update =
FUTURE_DISPATCH(_, &MesosContainerizerProcess::update);
{code}

> MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery is flaky
> ---
>
> Key: MESOS-4047
> URL: https://issues.apache.org/jira/browse/MESOS-4047
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Affects Versions: 0.26.0
> Environment: Ubuntu 14, gcc 4.8.4
>Reporter: Joseph Wu
>  Labels: flaky, flaky-test
>
> {code:title=Output from passed test}
> [--] 1 test from MemoryPressureMesosTest
> 1+0 records in
> 1+0 records out
> 1048576 bytes (1.0 MB) copied, 0.000430889 s, 2.4 GB/s
> [ RUN  ] MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery
> I1202 11:09:14.319327  5062 exec.cpp:134] Version: 0.27.0
> I1202 11:09:14.17  5079 exec.cpp:208] Executor registered on slave 
> bea15b35-9aa1-4b57-96fb-29b5f70638ac-S0
> Registered executor on ubuntu
> Starting task 4e62294c-cfcf-4a13-b699-c6a4b7ac5162
> sh -c 'while true; do dd count=512 bs=1M if=/dev/zero of=./temp; done'
> Forked command at 5085
> I1202 11:09:14.391739  5077 exec.cpp:254] Received reconnect request from 
> slave bea15b35-9aa1-4b57-96fb-29b5f70638ac-S0
> I1202 11:09:14.398598  5082 exec.cpp:231] Executor re-registered on slave 
> bea15b35-9aa1-4b57-96fb-29b5f70638ac-S0
> Re-registered executor on ubuntu
> Shutting down
> Sending SIGTERM to process tree at pid 5085
> Killing the following process trees:
> [ 
> -+- 5085 sh -c while true; do dd count=512 bs=1M if=/dev/zero of=./temp; done 
>  \--- 5086 dd count=512 bs=1M if=/dev/zero of=./temp 
> ]
> [   OK ] MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery (1096 ms)
> {code}
> {code:title=Output from failed test}
> [--] 1 test from MemoryPressureMesosTest
> 1+0 records in
> 1+0 records out
> 1048576 bytes (1.0 MB) copied, 0.000404489 s, 2.6 GB/s
> [ RUN  ] MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery
> I1202 11:09:15.509950  5109 exec.cpp:134] Version: 0.27.0
> I1202 11:09:15.568183  5123 exec.cpp:208] Executor registered on slave 
> 88734acc-718e-45b0-95b9-d8f07cea8a9e-S0
> Registered executor on ubuntu
> Starting task 14b6bab9-9f60-4130-bdc4-44efba262bc6
> Forked command at 5132
> sh -c 'while true; do dd count=512 bs=1M if=/dev/zero of=./temp; done'
> I1202 11:09:15.665498  5129 exec.cpp:254] Received reconnect request from 
> slave 88734acc-718e-45b0-95b9-d8f07cea8a9e-S0
> I1202 11:09:15.670995  5123 exec.cpp:381] Executor asked to shutdown
> Shutting down
> Sending SIGTERM to process tree at pid 5132
> ../../src/tests/containerizer/memory_pressure_tests.cpp:283: Failure
> (usage).failure(): Unknown container: ebe90e15-72fa-4519-837b-62f43052c913
> *** Aborted at 1449083355 (unix time) try "date -d @1449083355" if you are 
> using GNU date ***
> {code}
> Notice that in the failed test, the executor is asked to shutdown when it 
> tries to reconnect to the agent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-3586) MemoryPressureMesosTest.CGROUPS_ROOT_Statistics and CGROUPS_ROOT_SlaveRecovery are flaky

2015-12-02 Thread Joseph Wu (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-3586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu updated MESOS-3586:
-
Description: 
I am install Mesos 0.24.0 on 4 servers which have very similar hardware and 
software configurations. 

After performing {{../configure}}, {{make}}, and {{make check}} some servers 
have completed successfully and other failed on test {{[ RUN  ] 
MemoryPressureMesosTest.CGROUPS_ROOT_Statistics}}.

Is there something I should check in this test? 

{code}
PERFORMED MAKE CHECK NODE-001
[ RUN  ] MemoryPressureMesosTest.CGROUPS_ROOT_Statistics
I1005 14:37:35.585067 38479 exec.cpp:133] Version: 0.24.0
I1005 14:37:35.593789 38497 exec.cpp:207] Executor registered on slave 
20151005-143735-2393768202-35106-27900-S0
Registered executor on svdidac038.techlabs.accenture.com
Starting task 010b2fe9-4eac-4136-8a8a-6ce7665488b0
Forked command at 38510
sh -c 'while true; do dd count=512 bs=1M if=/dev/zero of=./temp; done'


PERFORMED MAKE CHECK NODE-002
[ RUN  ] MemoryPressureMesosTest.CGROUPS_ROOT_Statistics
I1005 14:38:58.794112 36997 exec.cpp:133] Version: 0.24.0
I1005 14:38:58.802851 37022 exec.cpp:207] Executor registered on slave 
20151005-143857-2360213770-50427-26325-S0
Registered executor on svdidac039.techlabs.accenture.com
Starting task 9bb317ba-41cb-44a4-b507-d1c85ceabc28
sh -c 'while true; do dd count=512 bs=1M if=/dev/zero of=./temp; done'
Forked command at 37028
../../src/tests/containerizer/memory_pressure_tests.cpp:145: Failure
Expected: (usage.get().mem_medium_pressure_counter()) >= 
(usage.get().mem_critical_pressure_counter()), actual: 5 vs 6
2015-10-05 
14:39:00,130:26325(0x2af08cc78700):ZOO_ERROR@handle_socket_error_msg@1697: 
Socket [127.0.0.1:37198] zk retcode=-4, errno=111(Connection refused): server 
refused to accept the client
[  FAILED  ] MemoryPressureMesosTest.CGROUPS_ROOT_Statistics (4303 ms)
{code}

  was:
I am install Mesos 0.24.0 on 4 servers which have very similar hardware and 
software configurations. 

After performing ../configure, make, and make check some servers have completed 
successfully and other failed on test [ RUN  ] 
MemoryPressureMesosTest.CGROUPS_ROOT_Statistics.

Is there something I should check in this test? 

PERFORMED MAKE CHECK NODE-001
[ RUN  ] MemoryPressureMesosTest.CGROUPS_ROOT_Statistics
I1005 14:37:35.585067 38479 exec.cpp:133] Version: 0.24.0
I1005 14:37:35.593789 38497 exec.cpp:207] Executor registered on slave 
20151005-143735-2393768202-35106-27900-S0
Registered executor on svdidac038.techlabs.accenture.com
Starting task 010b2fe9-4eac-4136-8a8a-6ce7665488b0
Forked command at 38510
sh -c 'while true; do dd count=512 bs=1M if=/dev/zero of=./temp; done'


PERFORMED MAKE CHECK NODE-002
[ RUN  ] MemoryPressureMesosTest.CGROUPS_ROOT_Statistics
I1005 14:38:58.794112 36997 exec.cpp:133] Version: 0.24.0
I1005 14:38:58.802851 37022 exec.cpp:207] Executor registered on slave 
20151005-143857-2360213770-50427-26325-S0
Registered executor on svdidac039.techlabs.accenture.com
Starting task 9bb317ba-41cb-44a4-b507-d1c85ceabc28
sh -c 'while true; do dd count=512 bs=1M if=/dev/zero of=./temp; done'
Forked command at 37028
../../src/tests/containerizer/memory_pressure_tests.cpp:145: Failure
Expected: (usage.get().mem_medium_pressure_counter()) >= 
(usage.get().mem_critical_pressure_counter()), actual: 5 vs 6
2015-10-05 
14:39:00,130:26325(0x2af08cc78700):ZOO_ERROR@handle_socket_error_msg@1697: 
Socket [127.0.0.1:37198] zk retcode=-4, errno=111(Connection refused): server 
refused to accept the client
[  FAILED  ] MemoryPressureMesosTest.CGROUPS_ROOT_Statistics (4303 ms)



> MemoryPressureMesosTest.CGROUPS_ROOT_Statistics and 
> CGROUPS_ROOT_SlaveRecovery are flaky
> 
>
> Key: MESOS-3586
> URL: https://issues.apache.org/jira/browse/MESOS-3586
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Affects Versions: 0.24.0, 0.26.0
> Environment: Ubuntu 14.04, 3.13.0-32 generic
> Debian 8, gcc 4.9.2
>Reporter: Miguel Bernadin
>Assignee: Joseph Wu
>  Labels: flaky, flaky-test
>
> I am install Mesos 0.24.0 on 4 servers which have very similar hardware and 
> software configurations. 
> After performing {{../configure}}, {{make}}, and {{make check}} some servers 
> have completed successfully and other failed on test {{[ RUN  ] 
> MemoryPressureMesosTest.CGROUPS_ROOT_Statistics}}.
> Is there something I should check in this test? 
> {code}
> PERFORMED MAKE CHECK NODE-001
> [ RUN  ] MemoryPressureMesosTest.CGROUPS_ROOT_Statistics
> I1005 14:37:35.585067 38479 exec.cpp:133] Version: 0.24.0
> I1005 14:37:35.593789 38497 exec.cpp:207] Executor registered on slave 
> 20151005-143735-2393768202-35106-27900-S0
> Registered executor on svdidac038.techlabs.ac

[jira] [Created] (MESOS-4047) MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery is flaky

2015-12-02 Thread Joseph Wu (JIRA)

Joseph Wu created MESOS-4047:


 Summary: MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery is 
flaky
 Key: MESOS-4047
 URL: https://issues.apache.org/jira/browse/MESOS-4047
 Project: Mesos
  Issue Type: Bug
  Components: test
Affects Versions: 0.26.0
 Environment: Ubuntu 14, gcc 4.8.4
Reporter: Joseph Wu


{code:title=Output from passed test}
[--] 1 test from MemoryPressureMesosTest
1+0 records in
1+0 records out
1048576 bytes (1.0 MB) copied, 0.000430889 s, 2.4 GB/s
[ RUN  ] MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery
I1202 11:09:14.319327  5062 exec.cpp:134] Version: 0.27.0
I1202 11:09:14.17  5079 exec.cpp:208] Executor registered on slave 
bea15b35-9aa1-4b57-96fb-29b5f70638ac-S0
Registered executor on ubuntu
Starting task 4e62294c-cfcf-4a13-b699-c6a4b7ac5162
sh -c 'while true; do dd count=512 bs=1M if=/dev/zero of=./temp; done'
Forked command at 5085
I1202 11:09:14.391739  5077 exec.cpp:254] Received reconnect request from slave 
bea15b35-9aa1-4b57-96fb-29b5f70638ac-S0
I1202 11:09:14.398598  5082 exec.cpp:231] Executor re-registered on slave 
bea15b35-9aa1-4b57-96fb-29b5f70638ac-S0
Re-registered executor on ubuntu
Shutting down
Sending SIGTERM to process tree at pid 5085
Killing the following process trees:
[ 
-+- 5085 sh -c while true; do dd count=512 bs=1M if=/dev/zero of=./temp; done 
 \--- 5086 dd count=512 bs=1M if=/dev/zero of=./temp 
]
[   OK ] MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery (1096 ms)
{code}

{code:title=Output from failed test}
[--] 1 test from MemoryPressureMesosTest
1+0 records in
1+0 records out
1048576 bytes (1.0 MB) copied, 0.000404489 s, 2.6 GB/s
[ RUN  ] MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery
I1202 11:09:15.509950  5109 exec.cpp:134] Version: 0.27.0
I1202 11:09:15.568183  5123 exec.cpp:208] Executor registered on slave 
88734acc-718e-45b0-95b9-d8f07cea8a9e-S0
Registered executor on ubuntu
Starting task 14b6bab9-9f60-4130-bdc4-44efba262bc6
Forked command at 5132
sh -c 'while true; do dd count=512 bs=1M if=/dev/zero of=./temp; done'
I1202 11:09:15.665498  5129 exec.cpp:254] Received reconnect request from slave 
88734acc-718e-45b0-95b9-d8f07cea8a9e-S0
I1202 11:09:15.670995  5123 exec.cpp:381] Executor asked to shutdown
Shutting down
Sending SIGTERM to process tree at pid 5132
../../src/tests/containerizer/memory_pressure_tests.cpp:283: Failure
(usage).failure(): Unknown container: ebe90e15-72fa-4519-837b-62f43052c913
*** Aborted at 1449083355 (unix time) try "date -d @1449083355" if you are 
using GNU date ***
{code}

Notice that in the failed test, the executor is asked to shutdown when it tries 
to reconnect to the agent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-3787) As a developer, I'd like to be able to expand environment variables through the Docker executor.

2015-12-02 Thread Jojy Varghese (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-3787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15036382#comment-15036382
 ] 

Jojy Varghese commented on MESOS-3787:
--

[~adam-mesos]

That is a great point. I would think that we would need access checks inside 
/after the expansion of the env variables.

> As a developer, I'd like to be able to expand environment variables through 
> the Docker executor.
> 
>
> Key: MESOS-3787
> URL: https://issues.apache.org/jira/browse/MESOS-3787
> Project: Mesos
>  Issue Type: Wish
>Reporter: John Garcia
>  Labels: mesosphere
> Attachments: mesos.patch, test-example.json
>
>
> We'd like to have expanded variables usable in [the json files used to create 
> a Marathon app, hence] the Task's CommandInfo, so that the executor is able 
> to detect the correct values at runtime.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-4045) NumifyTest.HexNumberTest fails

2015-12-02 Thread Neil Conway (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-4045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15036287#comment-15036287
 ] 

Neil Conway commented on MESOS-4045:


Repros for me on OSX 10.10

> NumifyTest.HexNumberTest fails
> --
>
> Key: MESOS-4045
> URL: https://issues.apache.org/jira/browse/MESOS-4045
> Project: Mesos
>  Issue Type: Bug
>  Components: stout
>Reporter: Michael Park
>
> {noformat}
> [==] Running 1 test from 1 test case.
> [--] Global test environment set-up.
> [--] 1 test from NumifyTest
> [ RUN  ] NumifyTest.HexNumberTest
> ../../../../3rdparty/libprocess/3rdparty/stout/tests/numify_tests.cpp:44: 
> Failure
> Value of: numify("0x10.9").isError()
>   Actual: false
> Expected: true
> [  FAILED  ] NumifyTest.HexNumberTest (0 ms)
> [--] 1 test from NumifyTest (0 ms total)
> [--] Global test environment tear-down
> [==] 1 test from 1 test case ran. (0 ms total)
> [  PASSED  ] 0 tests.
> [  FAILED  ] 1 test, listed below:
> [  FAILED  ] NumifyTest.HexNumberTest
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-4045) NumifyTest.HexNumberTest fails

2015-12-02 Thread Neil Conway (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-4045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neil Conway updated MESOS-4045:
---
Component/s: stout

> NumifyTest.HexNumberTest fails
> --
>
> Key: MESOS-4045
> URL: https://issues.apache.org/jira/browse/MESOS-4045
> Project: Mesos
>  Issue Type: Bug
>  Components: stout
>Reporter: Michael Park
>
> {noformat}
> [==] Running 1 test from 1 test case.
> [--] Global test environment set-up.
> [--] 1 test from NumifyTest
> [ RUN  ] NumifyTest.HexNumberTest
> ../../../../3rdparty/libprocess/3rdparty/stout/tests/numify_tests.cpp:44: 
> Failure
> Value of: numify("0x10.9").isError()
>   Actual: false
> Expected: true
> [  FAILED  ] NumifyTest.HexNumberTest (0 ms)
> [--] 1 test from NumifyTest (0 ms total)
> [--] Global test environment tear-down
> [==] 1 test from 1 test case ran. (0 ms total)
> [  PASSED  ] 0 tests.
> [  FAILED  ] 1 test, listed below:
> [  FAILED  ] NumifyTest.HexNumberTest
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (MESOS-4046) Enable `Env` specified in docker image can be returned from docker pull

2015-12-02 Thread Gilbert Song (JIRA)

Gilbert Song created MESOS-4046:
---

 Summary: Enable `Env` specified in docker image can be returned 
from docker pull
 Key: MESOS-4046
 URL: https://issues.apache.org/jira/browse/MESOS-4046
 Project: Mesos
  Issue Type: Improvement
  Components: docker
Reporter: Gilbert Song
Assignee: Gilbert Song


Currently docker pull only return an image structure, which only contains 
entrypoint info. We have docker inspect as a subprocess inside docker pull, 
which contains many other useful information of a docker image. We should be 
able to support returning environment variables information from the image.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-3183) Documentation images do not load

2015-12-02 Thread Joseph Wu (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-3183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu updated MESOS-3183:
-
Description: 
Any images which are referenced from the generated docs ({{docs/*.md}}) do not 
show up on the website.  For example:
* [Architecture|http://mesos.apache.org/documentation/latest/architecture/]
* [External 
Containerizer|http://mesos.apache.org/documentation/latest/external-containerizer/]
* [Fetcher Cache 
Internals|http://mesos.apache.org/documentation/latest/fetcher-cache-internals/]
* [Maintenance|http://mesos.apache.org/documentation/latest/maintenance/]   
* 
[Oversubscription|http://mesos.apache.org/documentation/latest/oversubscription/]


  was:
Any images which are referenced from the generated docs ({{docs/*.md}}) do not 
show up on the website.  For example:
* [External 
Containerizer|http://mesos.apache.org/documentation/latest/external-containerizer/]
* [Fetcher Cache 
Internals|http://mesos.apache.org/documentation/latest/fetcher-cache-internals/]
* [Maintenance|http://mesos.apache.org/documentation/latest/maintenance/]   
* 
[Oversubscription|http://mesos.apache.org/documentation/latest/oversubscription/]



> Documentation images do not load
> 
>
> Key: MESOS-3183
> URL: https://issues.apache.org/jira/browse/MESOS-3183
> Project: Mesos
>  Issue Type: Documentation
>  Components: documentation
>Affects Versions: 0.24.0
>Reporter: James Mulcahy
>Priority: Minor
>  Labels: mesosphere
> Attachments: rake.patch
>
>
> Any images which are referenced from the generated docs ({{docs/*.md}}) do 
> not show up on the website.  For example:
> * [Architecture|http://mesos.apache.org/documentation/latest/architecture/]
> * [External 
> Containerizer|http://mesos.apache.org/documentation/latest/external-containerizer/]
> * [Fetcher Cache 
> Internals|http://mesos.apache.org/documentation/latest/fetcher-cache-internals/]
> * [Maintenance|http://mesos.apache.org/documentation/latest/maintenance/] 
> * 
> [Oversubscription|http://mesos.apache.org/documentation/latest/oversubscription/]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-3493) benchmark for declining offers

2015-12-02 Thread James Peach (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-3493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15036178#comment-15036178
 ] 

James Peach commented on MESOS-3493:


thanks!

> benchmark for declining offers
> --
>
> Key: MESOS-3493
> URL: https://issues.apache.org/jira/browse/MESOS-3493
> Project: Mesos
>  Issue Type: Improvement
>  Components: test
>Reporter: James Peach
>Assignee: James Peach
>Priority: Minor
>  Labels: Mesosphere
>
> I wrote a benchmark that can be used to demonstrate the performance issues 
> addressed in MESOS-3052, MESOS-3051, MESOS-3157 and MESOS-3075. The benchmark 
> simulates a number of frameworks that start declining all offers once they 
> reach the limit of work they need to do.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-4045) NumifyTest.HexNumberTest fails

2015-12-02 Thread Michael Park (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-4045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15036157#comment-15036157
 ] 

Michael Park commented on MESOS-4045:
-

[~wangcong] Looks like this one was introduced in 
https://github.com/apache/mesos/commit/7745eea2a4f4dff6e12f1955baa996a1869af3dc 
?

> NumifyTest.HexNumberTest fails
> --
>
> Key: MESOS-4045
> URL: https://issues.apache.org/jira/browse/MESOS-4045
> Project: Mesos
>  Issue Type: Bug
>Reporter: Michael Park
>
> {noformat}
> [==] Running 1 test from 1 test case.
> [--] Global test environment set-up.
> [--] 1 test from NumifyTest
> [ RUN  ] NumifyTest.HexNumberTest
> ../../../../3rdparty/libprocess/3rdparty/stout/tests/numify_tests.cpp:44: 
> Failure
> Value of: numify("0x10.9").isError()
>   Actual: false
> Expected: true
> [  FAILED  ] NumifyTest.HexNumberTest (0 ms)
> [--] 1 test from NumifyTest (0 ms total)
> [--] Global test environment tear-down
> [==] 1 test from 1 test case ran. (0 ms total)
> [  PASSED  ] 0 tests.
> [  FAILED  ] 1 test, listed below:
> [  FAILED  ] NumifyTest.HexNumberTest
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (MESOS-4045) NumifyTest.HexNumberTest fails

2015-12-02 Thread Michael Park (JIRA)

Michael Park created MESOS-4045:
---

 Summary: NumifyTest.HexNumberTest fails
 Key: MESOS-4045
 URL: https://issues.apache.org/jira/browse/MESOS-4045
 Project: Mesos
  Issue Type: Bug
Reporter: Michael Park


{noformat}
[==] Running 1 test from 1 test case.
[--] Global test environment set-up.
[--] 1 test from NumifyTest
[ RUN  ] NumifyTest.HexNumberTest
../../../../3rdparty/libprocess/3rdparty/stout/tests/numify_tests.cpp:44: 
Failure
Value of: numify("0x10.9").isError()
  Actual: false
Expected: true
[  FAILED  ] NumifyTest.HexNumberTest (0 ms)
[--] 1 test from NumifyTest (0 ms total)

[--] Global test environment tear-down
[==] 1 test from 1 test case ran. (0 ms total)
[  PASSED  ] 0 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] NumifyTest.HexNumberTest
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-3493) benchmark for declining offers

2015-12-02 Thread Joris Van Remoortere (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-3493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15036132#comment-15036132
 ] 

Joris Van Remoortere commented on MESOS-3493:
-

[~jamespeach]Let's add this to the sprint for early next week :-)

> benchmark for declining offers
> --
>
> Key: MESOS-3493
> URL: https://issues.apache.org/jira/browse/MESOS-3493
> Project: Mesos
>  Issue Type: Improvement
>  Components: test
>Reporter: James Peach
>Assignee: James Peach
>Priority: Minor
>  Labels: Mesosphere
>
> I wrote a benchmark that can be used to demonstrate the performance issues 
> addressed in MESOS-3052, MESOS-3051, MESOS-3157 and MESOS-3075. The benchmark 
> simulates a number of frameworks that start declining all offers once they 
> reach the limit of work they need to do.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-3493) benchmark for declining offers

2015-12-02 Thread Joris Van Remoortere (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-3493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van Remoortere updated MESOS-3493:

Labels: Mesosphere  (was: )

> benchmark for declining offers
> --
>
> Key: MESOS-3493
> URL: https://issues.apache.org/jira/browse/MESOS-3493
> Project: Mesos
>  Issue Type: Improvement
>  Components: test
>Reporter: James Peach
>Assignee: James Peach
>Priority: Minor
>  Labels: Mesosphere
>
> I wrote a benchmark that can be used to demonstrate the performance issues 
> addressed in MESOS-3052, MESOS-3051, MESOS-3157 and MESOS-3075. The benchmark 
> simulates a number of frameworks that start declining all offers once they 
> reach the limit of work they need to do.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-4026) RegistryClientTest.SimpleRegistryPuller is flaky

2015-12-02 Thread Jojy Varghese (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-4026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15036072#comment-15036072
 ] 

Jojy Varghese commented on MESOS-4026:
--

https://reviews.apache.org/r/40872/

https://reviews.apache.org/r/40873/


The above patches have been tested 600+ times without the test being failed.

> RegistryClientTest.SimpleRegistryPuller is flaky
> 
>
> Key: MESOS-4026
> URL: https://issues.apache.org/jira/browse/MESOS-4026
> Project: Mesos
>  Issue Type: Bug
>Reporter: Anand Mazumdar
>Assignee: Jojy Varghese
>  Labels: containerizer, flaky-test, mesosphere
>
> From ASF CI:
> https://builds.apache.org/job/Mesos/1289/COMPILER=gcc,CONFIGURATION=--verbose%20--enable-libevent%20--enable-ssl,OS=centos:7,label_exp=docker%7C%7CHadoop/console
> {code}
> [ RUN  ] RegistryClientTest.SimpleRegistryPuller
> I1127 02:51:40.235900   362 registry_client.cpp:511] Response status for url 
> 'https://localhost:57828/v2/library/busybox/manifests/latest': 401 
> Unauthorized
> I1127 02:51:40.249766   360 registry_client.cpp:511] Response status for url 
> 'https://localhost:57828/v2/library/busybox/manifests/latest': 200 OK
> I1127 02:51:40.251137   361 registry_puller.cpp:195] Downloading layer 
> '1ce2e90b0bc7224de3db1f0d646fe8e2c4dd37f1793928287f6074bc451a57ea' for image 
> 'busybox:latest'
> I1127 02:51:40.258514   354 registry_client.cpp:511] Response status for url 
> 'https://localhost:57828/v2/library/busybox/blobs/sha256:a3ed95caeb02ffe68cdd9fd84406680ae93d633cb16422d00e8a7c22955b46d4':
>  307 Temporary Redirect
> I1127 02:51:40.264171   367 libevent_ssl_socket.cpp:1023] Socket error: 
> Connection reset by peer
> ../../src/tests/containerizer/provisioner_docker_tests.cpp:1210: Failure
> (socket).failure(): Failed accept: connection error: Connection reset by peer
> [  FAILED  ] RegistryClientTest.SimpleRegistryPuller (349 ms)
> {code}
> Logs from a previous run that passed:
> {code}
> [ RUN  ] RegistryClientTest.SimpleRegistryPuller
> I1126 18:49:05.306396   349 registry_client.cpp:511] Response status for url 
> 'https://localhost:53492/v2/library/busybox/manifests/latest': 401 
> Unauthorized
> I1126 18:49:05.321362   347 registry_client.cpp:511] Response status for url 
> 'https://localhost:53492/v2/library/busybox/manifests/latest': 200 OK
> I1126 18:49:05.322720   352 registry_puller.cpp:195] Downloading layer 
> '1ce2e90b0bc7224de3db1f0d646fe8e2c4dd37f1793928287f6074bc451a57ea' for image 
> 'busybox:latest'
> I1126 18:49:05.331317   350 registry_client.cpp:511] Response status for url 
> 'https://localhost:53492/v2/library/busybox/blobs/sha256:a3ed95caeb02ffe68cdd9fd84406680ae93d633cb16422d00e8a7c22955b46d4':
>  307 Temporary Redirect
> I1126 18:49:05.370625   352 registry_client.cpp:511] Response status for url 
> 'https://127.0.0.1:53492/': 200 OK
> I1126 18:49:05.372102   355 registry_puller.cpp:294] Untarring layer 
> '1ce2e90b0bc7224de3db1f0d646fe8e2c4dd37f1793928287f6074bc451a57ea' downloaded 
> from registry to directory 'output_dir'
> [   OK ] RegistryClientTest.SimpleRegistryPuller (353 ms)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-3219) Slave recovery issues with Docker containerizer

2015-12-02 Thread Yong Tang (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-3219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15036024#comment-15036024
 ] 

Yong Tang commented on MESOS-3219:
--

If your systems could assume that the container can always run, then a 
(partially) workaround is to have a shell script to constantly restart mesos 
slave in a loop within the container.

In this way, the shell script will be served as the foreground process so the 
container will not die.

If mesos slave process itself dies then at least the shell script will restart 
and recover correctly.

That obviously is not a complete solution but it may help in certain situations.

> Slave recovery issues with Docker containerizer
> ---
>
> Key: MESOS-3219
> URL: https://issues.apache.org/jira/browse/MESOS-3219
> Project: Mesos
>  Issue Type: Bug
>Reporter: Benjamin Anderson
>Assignee: Timothy Chen
>Priority: Minor
>
> I'm working on setting up a Mesos environment with the
> Docker containerizer and can't seem to get the recovery feature
> working. I'm running CoreOS, so the slave processes themselves are
> containerized. I have no issues running jobs without the recovery
> features enabled, but all jobs fail to boot when I add the following
> flags:
> MESOS_DOCKER_KILL_ORPHANS=false
> MESOS_DOCKER_MESOS_IMAGE=myrepo/my-slave-container
> Inspecting the Docker images and their log output reveals that the
> container invocation appears to be flawed - see this gist, which shows the 
> arguments as retrieved via `docker inspect` as well as the failed container's 
> log output:
> https://gist.github.com/banjiewen/a2dc1784a82ed87edd6b
> The containerizer is attempting to invoke an unquoted command via
> `/bin/sh -c`, which, predictably, fails to pass the complete command.
> This results in the error message shown in the second file in the
> linked gist.
> This is reproducible manually; quoting the arguments to `/bin/sh -c`
> results in success (at least, it correctly receives the supplied
> arguments).
> The slave container itself is not logging anything of interest.
> It's possible that my instance is configured incorrectly as well; the 
> documentation here is a bit vague and there aren't many examples on the web.
> I'm running Mesos 0.23.0 installed via http://repos.mesosphere.io/ in an 
> Ubuntu 14.04 container. CoreOS is at the latest stable (717.3.0) which gives 
> a Docker version at about 1.6.2.
> I'm happy to provide more details if necessary. Cheers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-4044) SlaveRecoveryTest/0.Reboot is flaky

2015-12-02 Thread Bernd Mathiske (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-4044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bernd Mathiske updated MESOS-4044:
--
Labels: cgroups flaky-test mesosphere  (was: mesosphere)

> SlaveRecoveryTest/0.Reboot is flaky
> ---
>
> Key: MESOS-4044
> URL: https://issues.apache.org/jira/browse/MESOS-4044
> Project: Mesos
>  Issue Type: Bug
>  Components: slave
> Environment: Debian 8 on VirtualBox
> {{configure --enable-debug --enable-ssl --enable-libevent}}
>Reporter: Alexander Rojas
>  Labels: cgroups, flaky-test, mesosphere
>
> Running the test program as:
> {code}
> sudo src/mesos-tests --gtest_filter="SlaveRecoveryTest/0.Reboot" 
> --gtest_repeat=100 --verbose --gtest_break_on_failure
> {code}
> ends up every time at some point with the failure:
> {noformat}
> [ RUN  ] SlaveRecoveryTest/0.Reboot
> I1202 15:18:00.036594 26328 leveldb.cpp:176] Opened db in 12.924775ms
> I1202 15:18:00.037643 26328 leveldb.cpp:183] Compacted db in 980477ns
> I1202 15:18:00.037693 26328 leveldb.cpp:198] Created db iterator in 15079ns
> I1202 15:18:00.037706 26328 leveldb.cpp:204] Seeked to beginning of db in 
> 1356ns
> I1202 15:18:00.037716 26328 leveldb.cpp:273] Iterated through 0 keys in the 
> db in 313ns
> I1202 15:18:00.037753 26328 replica.cpp:780] Replica recovered with log 
> positions 0 -> 0 with 1 holes and 0 unlearned
> I1202 15:18:00.038360 26346 recover.cpp:449] Starting replica recovery
> I1202 15:18:00.040987 26346 master.cpp:367] Master 
> baeb70c6-c960-4d0d-9dc7-48b9a54ef8a8 (debian-vm.localdomain) started on 
> 127.0.1.1:33625
> I1202 15:18:00.040998 26346 master.cpp:369] Flags at startup: --acls="" 
> --allocation_interval="1secs" --allocator="HierarchicalDRF" 
> --authenticate="true" --authenticate_slaves="true" --authenticators="crammd5" 
> --authorizers="local" --credentials="/tmp/xt1N2F/credentials" 
> --framework_sorter="drf" --help="false" --hostname_lookup="true" 
> --initialize_driver_logging="true" --log_auto_initialize="true" 
> --logbufsecs="0" --logging_level="INFO" --max_slave_ping_timeouts="5" 
> --quiet="false" --recovery_slave_removal_limit="100%" 
> --registry="replicated_log" --registry_fetch_timeout="1mins" 
> --registry_store_timeout="25secs" --registry_strict="true" 
> --root_submissions="true" --slave_ping_timeout="15secs" 
> --slave_reregister_timeout="10mins" --user_sorter="drf" --version="false" 
> --webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/xt1N2F/master" 
> --zk_session_timeout="10secs"
> I1202 15:18:00.041157 26346 master.cpp:414] Master only allowing 
> authenticated frameworks to register
> I1202 15:18:00.041163 26346 master.cpp:419] Master only allowing 
> authenticated slaves to register
> I1202 15:18:00.041168 26346 credentials.hpp:37] Loading credentials for 
> authentication from '/tmp/xt1N2F/credentials'
> I1202 15:18:00.041410 26346 master.cpp:458] Using default 'crammd5' 
> authenticator
> I1202 15:18:00.041524 26346 master.cpp:495] Authorization enabled
> I1202 15:18:00.042917 26343 recover.cpp:475] Replica is in EMPTY status
> I1202 15:18:00.043557 26343 master.cpp:1606] The newly elected leader is 
> master@127.0.1.1:33625 with id baeb70c6-c960-4d0d-9dc7-48b9a54ef8a8
> I1202 15:18:00.043577 26343 master.cpp:1619] Elected as the leading master!
> I1202 15:18:00.043589 26343 master.cpp:1379] Recovering from registrar
> I1202 15:18:00.043766 26343 registrar.cpp:309] Recovering registrar
> I1202 15:18:00.044668 26344 replica.cpp:676] Replica in EMPTY status received 
> a broadcasted recover request from (21064)@127.0.1.1:33625
> I1202 15:18:00.045027 26349 recover.cpp:195] Received a recover response from 
> a replica in EMPTY status
> I1202 15:18:00.045497 26349 recover.cpp:566] Updating replica status to 
> STARTING
> I1202 15:18:00.055539 26349 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 9.859161ms
> I1202 15:18:00.055599 26349 replica.cpp:323] Persisted replica status to 
> STARTING
> I1202 15:18:00.055958 26346 recover.cpp:475] Replica is in STARTING status
> I1202 15:18:00.057106 26342 replica.cpp:676] Replica in STARTING status 
> received a broadcasted recover request from (21065)@127.0.1.1:33625
> I1202 15:18:00.057462 26343 recover.cpp:195] Received a recover response from 
> a replica in STARTING status
> I1202 15:18:00.057886 26347 recover.cpp:566] Updating replica status to VOTING
> I1202 15:18:00.058706 26345 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 634303ns
> I1202 15:18:00.058724 26345 replica.cpp:323] Persisted replica status to 
> VOTING
> I1202 15:18:00.058821 26345 recover.cpp:580] Successfully joined the Paxos 
> group
> I1202 15:18:00.058980 26345 recover.cpp:464] Recover process terminated
> I1202 15:18:00.059288 26348 log.cpp:661] Attempting to start the writer
> I1202 15:18:00.0603

[jira] [Updated] (MESOS-4043) CgroupsAnyHierarchyWithFreezerTest.ROOT_CGROUPS_DestroyTracedProcess is Flaky

2015-12-02 Thread Bernd Mathiske (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-4043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bernd Mathiske updated MESOS-4043:
--
Labels: cgroups flaky-test  (was: )

> CgroupsAnyHierarchyWithFreezerTest.ROOT_CGROUPS_DestroyTracedProcess is Flaky
> -
>
> Key: MESOS-4043
> URL: https://issues.apache.org/jira/browse/MESOS-4043
> Project: Mesos
>  Issue Type: Bug
>  Components: isolation
> Environment: Debian 8 (In virutal machine).
> Build with:
> {{configure --enable-ssl --enable-libevent --enable-debug}}
>Reporter: Alexander Rojas
>Assignee: Timothy Chen
>  Labels: cgroups, flaky-test
>
> Running the test with 
> {code}
>  sudo src/mesos-tests --gtest_repeat=100 --verbose --gtest_break_on_failur
> {code}
>  yielded at least once:
> {noformat}
> [ RUN  ] 
> CgroupsAnyHierarchyWithFreezerTest.ROOT_CGROUPS_DestroyTracedProcess
> I1202 14:59:40.530966  2564 cgroups.cpp:2429] Freezing cgroup 
> /sys/fs/cgroup/freezer/mesos_test
> I1202 14:59:40.546022  2566 cgroups.cpp:1411] Successfully froze cgroup 
> /sys/fs/cgroup/freezer/mesos_test after 14.974976ms
> I1202 14:59:40.560233  2566 cgroups.cpp:2447] Thawing cgroup 
> /sys/fs/cgroup/freezer/mesos_test
> I1202 14:59:40.574983  2570 cgroups.cpp:1440] Successfullly thawed cgroup 
> /sys/fs/cgroup/freezer/mesos_test after 14.671104ms
> ../../src/tests/containerizer/cgroups_tests.cpp:939: Failure
> Value of: ::waitpid(pid, &status, 0)
>   Actual: 26319
> Expected: -1
> *** Aborted at 1449064780 (unix time) try "date -d @1449064780" if you are 
> using GNU date ***
> PC: @  0x14b07ae testing::UnitTest::AddTestPartResult()
> *** SIGSEGV (@0x0) received by PID 2549 (TID 0x7f0017c287c0) from PID 0; 
> stack trace: ***
> @ 0x7e9b866c os::Linux::chained_handler()
> @ 0x7e9bca0a JVM_handle_linux_signal
> @ 0x7f00115458d0 (unknown)
> @  0x14b07ae testing::UnitTest::AddTestPartResult()
> @  0x14a51e7 testing::internal::AssertHelper::operator=()
> @  0x14129f4 
> mesos::internal::tests::CgroupsAnyHierarchyWithFreezerTest_ROOT_CGROUPS_DestroyTracedProcess_Test::TestBody()
> @  0x14ce2d0 
> testing::internal::HandleSehExceptionsInMethodIfSupported<>()
> @  0x14c9248 
> testing::internal::HandleExceptionsInMethodIfSupported<>()
> @  0x14aa587 testing::Test::Run()
> @  0x14aad15 testing::TestInfo::Run()
> @  0x14ab350 testing::TestCase::Run()
> @  0x14b1c9f testing::internal::UnitTestImpl::RunAllTests()
> @  0x14cef5f 
> testing::internal::HandleSehExceptionsInMethodIfSupported<>()
> @  0x14c9d9e 
> testing::internal::HandleExceptionsInMethodIfSupported<>()
> @  0x14b09cf testing::UnitTest::Run()
> @   0xd63e02 RUN_ALL_TESTS()
> @   0xd639e0 main
> @ 0x7f00111aeb45 (unknown)
> @   0x9588f9 (unknown)
> {noformat}
> However running:
> {code}
> sudo src/mesos-tests 
> --gtest_filter="CgroupsAnyHierarchyWithFreezerTest.ROOT_CGROUPS_DestroyTracedProcess"
>  --gtest_repeat=1000 --verbose --gtest_break_on_failure
> {code}
> Doesn't reproduce the error. It may be cause by a state left by a previous 
> test.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (MESOS-4044) SlaveRecoveryTest/0.Reboot is flaky

2015-12-02 Thread Alexander Rojas (JIRA)

Alexander Rojas created MESOS-4044:
--

 Summary: SlaveRecoveryTest/0.Reboot is flaky
 Key: MESOS-4044
 URL: https://issues.apache.org/jira/browse/MESOS-4044
 Project: Mesos
  Issue Type: Bug
  Components: slave
 Environment: Debian 8 on VirtualBox

{{configure --enable-debug --enable-ssl --enable-libevent}}
Reporter: Alexander Rojas


Running the test program as:

{code}
sudo src/mesos-tests --gtest_filter="SlaveRecoveryTest/0.Reboot" 
--gtest_repeat=100 --verbose --gtest_break_on_failure
{code}

ends up every time at some point with the failure:

{noformat}
[ RUN  ] SlaveRecoveryTest/0.Reboot
I1202 15:18:00.036594 26328 leveldb.cpp:176] Opened db in 12.924775ms
I1202 15:18:00.037643 26328 leveldb.cpp:183] Compacted db in 980477ns
I1202 15:18:00.037693 26328 leveldb.cpp:198] Created db iterator in 15079ns
I1202 15:18:00.037706 26328 leveldb.cpp:204] Seeked to beginning of db in 1356ns
I1202 15:18:00.037716 26328 leveldb.cpp:273] Iterated through 0 keys in the db 
in 313ns
I1202 15:18:00.037753 26328 replica.cpp:780] Replica recovered with log 
positions 0 -> 0 with 1 holes and 0 unlearned
I1202 15:18:00.038360 26346 recover.cpp:449] Starting replica recovery
I1202 15:18:00.040987 26346 master.cpp:367] Master 
baeb70c6-c960-4d0d-9dc7-48b9a54ef8a8 (debian-vm.localdomain) started on 
127.0.1.1:33625
I1202 15:18:00.040998 26346 master.cpp:369] Flags at startup: --acls="" 
--allocation_interval="1secs" --allocator="HierarchicalDRF" 
--authenticate="true" --authenticate_slaves="true" --authenticators="crammd5" 
--authorizers="local" --credentials="/tmp/xt1N2F/credentials" 
--framework_sorter="drf" --help="false" --hostname_lookup="true" 
--initialize_driver_logging="true" --log_auto_initialize="true" 
--logbufsecs="0" --logging_level="INFO" --max_slave_ping_timeouts="5" 
--quiet="false" --recovery_slave_removal_limit="100%" 
--registry="replicated_log" --registry_fetch_timeout="1mins" 
--registry_store_timeout="25secs" --registry_strict="true" 
--root_submissions="true" --slave_ping_timeout="15secs" 
--slave_reregister_timeout="10mins" --user_sorter="drf" --version="false" 
--webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/xt1N2F/master" 
--zk_session_timeout="10secs"
I1202 15:18:00.041157 26346 master.cpp:414] Master only allowing authenticated 
frameworks to register
I1202 15:18:00.041163 26346 master.cpp:419] Master only allowing authenticated 
slaves to register
I1202 15:18:00.041168 26346 credentials.hpp:37] Loading credentials for 
authentication from '/tmp/xt1N2F/credentials'
I1202 15:18:00.041410 26346 master.cpp:458] Using default 'crammd5' 
authenticator
I1202 15:18:00.041524 26346 master.cpp:495] Authorization enabled
I1202 15:18:00.042917 26343 recover.cpp:475] Replica is in EMPTY status
I1202 15:18:00.043557 26343 master.cpp:1606] The newly elected leader is 
master@127.0.1.1:33625 with id baeb70c6-c960-4d0d-9dc7-48b9a54ef8a8
I1202 15:18:00.043577 26343 master.cpp:1619] Elected as the leading master!
I1202 15:18:00.043589 26343 master.cpp:1379] Recovering from registrar
I1202 15:18:00.043766 26343 registrar.cpp:309] Recovering registrar
I1202 15:18:00.044668 26344 replica.cpp:676] Replica in EMPTY status received a 
broadcasted recover request from (21064)@127.0.1.1:33625
I1202 15:18:00.045027 26349 recover.cpp:195] Received a recover response from a 
replica in EMPTY status
I1202 15:18:00.045497 26349 recover.cpp:566] Updating replica status to STARTING
I1202 15:18:00.055539 26349 leveldb.cpp:306] Persisting metadata (8 bytes) to 
leveldb took 9.859161ms
I1202 15:18:00.055599 26349 replica.cpp:323] Persisted replica status to 
STARTING
I1202 15:18:00.055958 26346 recover.cpp:475] Replica is in STARTING status
I1202 15:18:00.057106 26342 replica.cpp:676] Replica in STARTING status 
received a broadcasted recover request from (21065)@127.0.1.1:33625
I1202 15:18:00.057462 26343 recover.cpp:195] Received a recover response from a 
replica in STARTING status
I1202 15:18:00.057886 26347 recover.cpp:566] Updating replica status to VOTING
I1202 15:18:00.058706 26345 leveldb.cpp:306] Persisting metadata (8 bytes) to 
leveldb took 634303ns
I1202 15:18:00.058724 26345 replica.cpp:323] Persisted replica status to VOTING
I1202 15:18:00.058821 26345 recover.cpp:580] Successfully joined the Paxos group
I1202 15:18:00.058980 26345 recover.cpp:464] Recover process terminated
I1202 15:18:00.059288 26348 log.cpp:661] Attempting to start the writer
I1202 15:18:00.060330 26342 replica.cpp:496] Replica received implicit promise 
request from (21066)@127.0.1.1:33625 with proposal 1
I1202 15:18:00.061751 26342 leveldb.cpp:306] Persisting metadata (8 bytes) to 
leveldb took 1.395961ms
I1202 15:18:00.061774 26342 replica.cpp:345] Persisted promised to 1
I1202 15:18:00.062237 26342 coordinator.cpp:240] Coordinator attempting to fill 
missing positions
I1202 15:18:00.063148 26342 repli

[jira] [Created] (MESOS-4043) CgroupsAnyHierarchyWithFreezerTest.ROOT_CGROUPS_DestroyTracedProcess is Flaky

2015-12-02 Thread Alexander Rojas (JIRA)

Alexander Rojas created MESOS-4043:
--

 Summary: 
CgroupsAnyHierarchyWithFreezerTest.ROOT_CGROUPS_DestroyTracedProcess is Flaky
 Key: MESOS-4043
 URL: https://issues.apache.org/jira/browse/MESOS-4043
 Project: Mesos
  Issue Type: Bug
  Components: isolation
 Environment: Debian 8 (In virutal machine).

Build with:
{{configure --enable-ssl --enable-libevent --enable-debug}}
Reporter: Alexander Rojas
Assignee: Timothy Chen


Running the test with 
{code}
 sudo src/mesos-tests --gtest_repeat=100 --verbose --gtest_break_on_failur
{code}
 yielded at least once:
{noformat}
[ RUN  ] 
CgroupsAnyHierarchyWithFreezerTest.ROOT_CGROUPS_DestroyTracedProcess
I1202 14:59:40.530966  2564 cgroups.cpp:2429] Freezing cgroup 
/sys/fs/cgroup/freezer/mesos_test
I1202 14:59:40.546022  2566 cgroups.cpp:1411] Successfully froze cgroup 
/sys/fs/cgroup/freezer/mesos_test after 14.974976ms
I1202 14:59:40.560233  2566 cgroups.cpp:2447] Thawing cgroup 
/sys/fs/cgroup/freezer/mesos_test
I1202 14:59:40.574983  2570 cgroups.cpp:1440] Successfullly thawed cgroup 
/sys/fs/cgroup/freezer/mesos_test after 14.671104ms
../../src/tests/containerizer/cgroups_tests.cpp:939: Failure
Value of: ::waitpid(pid, &status, 0)
  Actual: 26319
Expected: -1
*** Aborted at 1449064780 (unix time) try "date -d @1449064780" if you are 
using GNU date ***
PC: @  0x14b07ae testing::UnitTest::AddTestPartResult()
*** SIGSEGV (@0x0) received by PID 2549 (TID 0x7f0017c287c0) from PID 0; stack 
trace: ***
@ 0x7e9b866c os::Linux::chained_handler()
@ 0x7e9bca0a JVM_handle_linux_signal
@ 0x7f00115458d0 (unknown)
@  0x14b07ae testing::UnitTest::AddTestPartResult()
@  0x14a51e7 testing::internal::AssertHelper::operator=()
@  0x14129f4 
mesos::internal::tests::CgroupsAnyHierarchyWithFreezerTest_ROOT_CGROUPS_DestroyTracedProcess_Test::TestBody()
@  0x14ce2d0 
testing::internal::HandleSehExceptionsInMethodIfSupported<>()
@  0x14c9248 
testing::internal::HandleExceptionsInMethodIfSupported<>()
@  0x14aa587 testing::Test::Run()
@  0x14aad15 testing::TestInfo::Run()
@  0x14ab350 testing::TestCase::Run()
@  0x14b1c9f testing::internal::UnitTestImpl::RunAllTests()
@  0x14cef5f 
testing::internal::HandleSehExceptionsInMethodIfSupported<>()
@  0x14c9d9e 
testing::internal::HandleExceptionsInMethodIfSupported<>()
@  0x14b09cf testing::UnitTest::Run()
@   0xd63e02 RUN_ALL_TESTS()
@   0xd639e0 main
@ 0x7f00111aeb45 (unknown)
@   0x9588f9 (unknown)
{noformat}

However running:

{code}
sudo src/mesos-tests 
--gtest_filter="CgroupsAnyHierarchyWithFreezerTest.ROOT_CGROUPS_DestroyTracedProcess"
 --gtest_repeat=1000 --verbose --gtest_break_on_failure
{code}

Doesn't reproduce the error. It may be cause by a state left by a previous test.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-4041) Default command executor do not have executor_id

2015-12-02 Thread Klaus Ma (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-4041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15035836#comment-15035836
 ] 

Klaus Ma commented on MESOS-4041:
-

[~gyliu], for command line, Master does not know executor information; the 
executor info is build in Mesos Slave. I'm working on MESOS-1718 to re-build 
command executor info in Mesos Master; so
1. no resources overcommited, I'm going to cut some resources of command tasks 
to executor
2. Master has the executor info for Optimistic Offer Phase 1


> Default command executor do not have executor_id
> 
>
> Key: MESOS-4041
> URL: https://issues.apache.org/jira/browse/MESOS-4041
> Project: Mesos
>  Issue Type: Bug
>Reporter: Guangya Liu
>
> I was doing some test with Marathon on top of mesos and found that when using 
> mesos command executor, the executor_id is always empty.
> {code}
> "state": "TASK_RUNNING",
>   "slave_id": "62c4d3e2-7c80-4d80-a0cd-57b2eced1d81-S0",
>   "resources": {
> "ports": "[33505-33505]",
> "mem": 16,
> "disk": 0,
> "cpus": 0.1
>   },
>   "name": "t1",
>   "id": "t1.ac8c4679-98cd-11e5-8b71-823a8274cef3",
>   "framework_id": "62c4d3e2-7c80-4d80-a0cd-57b2eced1d81-0001",
>   "executor_id": ""
> }
> {code}
> When I test with mesos-executor with command line executor, also no executor 
> id.
> {code}
> {
>   "unregistered_frameworks": [],
>   "frameworks": [
> {
>   "webui_url": "",
>   "user": "root",
>   "used_resources": {
> "mem": 256,
> "disk": 0,
> "cpus": 1
>   },
>   "unregistered_time": 0,
>   "id": "820e082f-be7c-4b59-abc5-9c02f9e8d66d-",
>   "hostname": "devstack007.cn.ibm.com",
>   "failover_timeout": 0,
>   "executors": [],
>   "completed_tasks": [],
>   "checkpoint": false,
>   "capabilities": [],
>   "active": true,
>   "name": "",
>   "offered_resources": {
> "mem": 0,
> "disk": 0,
> "cpus": 0
>   },
>   "offers": [],
>   "pid": 
> "scheduler-43ee3a42-9302-4815-8a07-500e42367f41@9.111.242.187:60454",
>   "registered_time": 1445236263.95058,
>   "resources": {
> "mem": 256,
> "disk": 0,
> "cpus": 1
>   },
>   "role": "*",
>   "tasks": [
> {
>   "statuses": [
> {
>   "timestamp": 1445236266.63443,
>   "state": "TASK_RUNNING",
>   "container_status": {
> "network_infos": [
>   {
> "ip_address": "9.111.242.187"
>   }
> ]
>   }
> }
>   ],
>   "state": "TASK_RUNNING",
>   "slave_id": "3e0df733-08b3-4883-b3fa-92bdc0c05b2f-S0",
>   "resources": {
> "mem": 256,
> "disk": 0,
> "cpus": 1
>   },
>   "name": "cluster-test",
>   "id": "cluster-test",
>   "framework_id": "820e082f-be7c-4b59-abc5-9c02f9e8d66d-",
>   "executor_id": ""
> }
>   ]
> }
>   ],
>   "completed_frameworks": []
> }
> {code}
> This caused the end use can not use http end point to kill some executors as 
> mesos require executor id when kill executors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-4042) Complete LevelDBStateTest suite fails in optimized build

2015-12-02 Thread Benjamin Bannier (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-4042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Bannier updated MESOS-4042:

Description: 
Building and checking {{5c0e4dc974014b0afd1f2752ff60a61c651de478}} in a 
ubuntu14.04 virtualbox with {{--enable-optimized}} in a virtualbox shared 
folder fails with
{code}
[ RUN  ] LevelDBStateTest.FetchAndStoreAndFetch
../../src/tests/state_tests.cpp:90: Failure
(future1).failure(): IO error: /vagrant/mesos/build/.state/MANIFEST-01: 
Invalid argument
[  FAILED  ] LevelDBStateTest.FetchAndStoreAndFetch (15 ms)
[ RUN  ] LevelDBStateTest.FetchAndStoreAndStoreAndFetch
../../src/tests/state_tests.cpp:120: Failure
(future1).failure(): IO error: /vagrant/mesos/build/.state/MANIFEST-01: 
Invalid argument
[  FAILED  ] LevelDBStateTest.FetchAndStoreAndStoreAndFetch (13 ms)
[ RUN  ] LevelDBStateTest.FetchAndStoreAndStoreFailAndFetch
../../src/tests/state_tests.cpp:156: Failure
(future1).failure(): IO error: /vagrant/mesos/build/.state/MANIFEST-01: 
Invalid argument
[  FAILED  ] LevelDBStateTest.FetchAndStoreAndStoreFailAndFetch (10 ms)
[ RUN  ] LevelDBStateTest.FetchAndStoreAndExpungeAndFetch
../../src/tests/state_tests.cpp:198: Failure
(future1).failure(): IO error: /vagrant/mesos/build/.state/MANIFEST-01: 
Invalid argument
[  FAILED  ] LevelDBStateTest.FetchAndStoreAndExpungeAndFetch (10 ms)
[ RUN  ] LevelDBStateTest.FetchAndStoreAndExpungeAndExpunge
../../src/tests/state_tests.cpp:233: Failure
(future1).failure(): IO error: /vagrant/mesos/build/.state/MANIFEST-01: 
Invalid argument
[  FAILED  ] LevelDBStateTest.FetchAndStoreAndExpungeAndExpunge (10 ms)
[ RUN  ] LevelDBStateTest.FetchAndStoreAndExpungeAndStoreAndFetch
../../src/tests/state_tests.cpp:264: Failure
(future1).failure(): IO error: /vagrant/mesos/build/.state/MANIFEST-01: 
Invalid argument
[  FAILED  ] LevelDBStateTest.FetchAndStoreAndExpungeAndStoreAndFetch (12 ms)
[ RUN  ] LevelDBStateTest.Names
../../src/tests/state_tests.cpp:304: Failure
(future1).failure(): IO error: /vagrant/mesos/build/.state/MANIFEST-01: 
Invalid argument
[  FAILED  ] LevelDBStateTest.Names (10 ms)
{code}

The identical error occurs for a non-optimized build.

  was:
Building and checking {{5c0e4dc974014b0afd1f2752ff60a61c651de478}} in a 
ubuntu14.04 virtualbox with {{--enable-optimized}} in a virtualbox shared 
folder fails with
{code}
[ RUN  ] LevelDBStateTest.FetchAndStoreAndFetch
../../src/tests/state_tests.cpp:90: Failure
(future1).failure(): IO error: /vagrant/mesos/build/.state/MANIFEST-01: 
Invalid argument
[  FAILED  ] LevelDBStateTest.FetchAndStoreAndFetch (15 ms)
[ RUN  ] LevelDBStateTest.FetchAndStoreAndStoreAndFetch
../../src/tests/state_tests.cpp:120: Failure
(future1).failure(): IO error: /vagrant/mesos/build/.state/MANIFEST-01: 
Invalid argument
[  FAILED  ] LevelDBStateTest.FetchAndStoreAndStoreAndFetch (13 ms)
[ RUN  ] LevelDBStateTest.FetchAndStoreAndStoreFailAndFetch
../../src/tests/state_tests.cpp:156: Failure
(future1).failure(): IO error: /vagrant/mesos/build/.state/MANIFEST-01: 
Invalid argument
[  FAILED  ] LevelDBStateTest.FetchAndStoreAndStoreFailAndFetch (10 ms)
[ RUN  ] LevelDBStateTest.FetchAndStoreAndExpungeAndFetch
../../src/tests/state_tests.cpp:198: Failure
(future1).failure(): IO error: /vagrant/mesos/build/.state/MANIFEST-01: 
Invalid argument
[  FAILED  ] LevelDBStateTest.FetchAndStoreAndExpungeAndFetch (10 ms)
[ RUN  ] LevelDBStateTest.FetchAndStoreAndExpungeAndExpunge
../../src/tests/state_tests.cpp:233: Failure
(future1).failure(): IO error: /vagrant/mesos/build/.state/MANIFEST-01: 
Invalid argument
[  FAILED  ] LevelDBStateTest.FetchAndStoreAndExpungeAndExpunge (10 ms)
[ RUN  ] LevelDBStateTest.FetchAndStoreAndExpungeAndStoreAndFetch
../../src/tests/state_tests.cpp:264: Failure
(future1).failure(): IO error: /vagrant/mesos/build/.state/MANIFEST-01: 
Invalid argument
[  FAILED  ] LevelDBStateTest.FetchAndStoreAndExpungeAndStoreAndFetch (12 ms)
[ RUN  ] LevelDBStateTest.Names
../../src/tests/state_tests.cpp:304: Failure
(future1).failure(): IO error: /vagrant/mesos/build/.state/MANIFEST-01: 
Invalid argument
[  FAILED  ] LevelDBStateTest.Names (10 ms)
{code}

At least for me not optimized builds seem unaffected.


> Complete LevelDBStateTest suite fails in optimized build
> 
>
> Key: MESOS-4042
> URL: https://issues.apache.org/jira/browse/MESOS-4042
> Project: Mesos
>  Issue Type: Bug
>Reporter: Benjamin Bannier
>
> Building and checking {{5c0e4dc974014b0afd1f2752ff60a61c651de478}} in a 
> ubuntu14.04 virtualbox with {{--enable-optimized}} in a virtualbox shared 
> folder fails with
> {code}
> [ RUN  ] LevelDBStateTest.FetchAndStoreAndFetch
> ../../src/tests/state_tests.cpp:90: Failure
> (fut

[jira] [Created] (MESOS-4042) Complete LevelDBStateTest suite fails in optimized build

2015-12-02 Thread Benjamin Bannier (JIRA)

Benjamin Bannier created MESOS-4042:
---

 Summary: Complete LevelDBStateTest suite fails in optimized build
 Key: MESOS-4042
 URL: https://issues.apache.org/jira/browse/MESOS-4042
 Project: Mesos
  Issue Type: Bug
Reporter: Benjamin Bannier


Building and checking {{5c0e4dc974014b0afd1f2752ff60a61c651de478}} in a 
ubuntu14.04 virtualbox with {{--enable-optimized}} in a virtualbox shared 
folder fails with
{code}
[ RUN  ] LevelDBStateTest.FetchAndStoreAndFetch
../../src/tests/state_tests.cpp:90: Failure
(future1).failure(): IO error: /vagrant/mesos/build/.state/MANIFEST-01: 
Invalid argument
[  FAILED  ] LevelDBStateTest.FetchAndStoreAndFetch (15 ms)
[ RUN  ] LevelDBStateTest.FetchAndStoreAndStoreAndFetch
../../src/tests/state_tests.cpp:120: Failure
(future1).failure(): IO error: /vagrant/mesos/build/.state/MANIFEST-01: 
Invalid argument
[  FAILED  ] LevelDBStateTest.FetchAndStoreAndStoreAndFetch (13 ms)
[ RUN  ] LevelDBStateTest.FetchAndStoreAndStoreFailAndFetch
../../src/tests/state_tests.cpp:156: Failure
(future1).failure(): IO error: /vagrant/mesos/build/.state/MANIFEST-01: 
Invalid argument
[  FAILED  ] LevelDBStateTest.FetchAndStoreAndStoreFailAndFetch (10 ms)
[ RUN  ] LevelDBStateTest.FetchAndStoreAndExpungeAndFetch
../../src/tests/state_tests.cpp:198: Failure
(future1).failure(): IO error: /vagrant/mesos/build/.state/MANIFEST-01: 
Invalid argument
[  FAILED  ] LevelDBStateTest.FetchAndStoreAndExpungeAndFetch (10 ms)
[ RUN  ] LevelDBStateTest.FetchAndStoreAndExpungeAndExpunge
../../src/tests/state_tests.cpp:233: Failure
(future1).failure(): IO error: /vagrant/mesos/build/.state/MANIFEST-01: 
Invalid argument
[  FAILED  ] LevelDBStateTest.FetchAndStoreAndExpungeAndExpunge (10 ms)
[ RUN  ] LevelDBStateTest.FetchAndStoreAndExpungeAndStoreAndFetch
../../src/tests/state_tests.cpp:264: Failure
(future1).failure(): IO error: /vagrant/mesos/build/.state/MANIFEST-01: 
Invalid argument
[  FAILED  ] LevelDBStateTest.FetchAndStoreAndExpungeAndStoreAndFetch (12 ms)
[ RUN  ] LevelDBStateTest.Names
../../src/tests/state_tests.cpp:304: Failure
(future1).failure(): IO error: /vagrant/mesos/build/.state/MANIFEST-01: 
Invalid argument
[  FAILED  ] LevelDBStateTest.Names (10 ms)
{code}

At least for me not optimized builds seem unaffected.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-3215) CgroupsAnyHierarchyWithPerfEventTest failing on Ubuntu 14.04

2015-12-02 Thread Bernd Mathiske (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-3215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15035689#comment-15035689
 ] 

Bernd Mathiske commented on MESOS-3215:
---

Do you have perf enabled on that machine?

> CgroupsAnyHierarchyWithPerfEventTest failing on Ubuntu 14.04
> 
>
> Key: MESOS-3215
> URL: https://issues.apache.org/jira/browse/MESOS-3215
> Project: Mesos
>  Issue Type: Bug
>Reporter: Artem Harutyunyan
>  Labels: mesosphere
>
> [ RUN  ] CgroupsAnyHierarchyWithPerfEventTest.ROOT_CGROUPS_Perf
> ../../src/tests/containerizer/cgroups_tests.cpp:172: Failure
> (cgroups::destroy(hierarchy, cgroup)).failure(): Failed to remove cgroup 
> '/sys/fs/cgroup/perf_event/mesos_test': Device or resource busy
> ../../src/tests/containerizer/cgroups_tests.cpp:190: Failure
> (cgroups::destroy(hierarchy, cgroup)).failure(): Failed to remove cgroup 
> '/sys/fs/cgroup/perf_event/mesos_test': Device or resource busy
> [  FAILED  ] CgroupsAnyHierarchyWithPerfEventTest.ROOT_CGROUPS_Perf (9 ms)
> [--] 1 test from CgroupsAnyHierarchyWithPerfEventTest (9 ms total)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (MESOS-4035) UserCgroupIsolatorTest.ROOT_CGROUPS_UserCgroup fails on CentOS 6.6

2015-12-02 Thread Bernd Mathiske (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-4035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15035684#comment-15035684
 ] 

Bernd Mathiske edited comment on MESOS-4035 at 12/2/15 11:48 AM:
-

It seems that the right "fix" is adding documentation about needing to have 
perf support, respectively what to expect on which kinds of VMs on how to set 
them up. And also auto-disabling the test (ideally prompting a message about 
why then): https://issues.apache.org/jira/browse/MESOS-3471


was (Author: bernd-mesos):
It seems that the right "fix" is adding documentation about needing to have 
perf support, respectively what to expect on which kinds of VMs on how to set 
them up.

> UserCgroupIsolatorTest.ROOT_CGROUPS_UserCgroup fails on CentOS 6.6
> --
>
> Key: MESOS-4035
> URL: https://issues.apache.org/jira/browse/MESOS-4035
> Project: Mesos
>  Issue Type: Bug
> Environment: CentOS6.6
>Reporter: Gilbert Song
>
> `ROOT_CGROUPS_UserCgroup` on CentOS6.6 with 0.26rc3. The environment setup on 
> CentOS6.6 is based on latest update of /docs/getting-started.md. Either using 
> devtoolset-2 or devtoolset-3 returns the same failure. 
> If running `sudo ./bin/mesos-tests.sh 
> --gtest_filter="*ROOT_CGROUPS_UserCgroup*"`, it would return failures as 
> following log:
> {noformat}
> [==] Running 3 tests from 3 test cases.
> [--] Global test environment set-up.
> [--] 1 test from UserCgroupIsolatorTest/0, where TypeParam = 
> mesos::internal::slave::CgroupsMemIsolatorProcess
> userdel: user 'mesos.test.unprivileged.user' does not exist
> [ RUN  ] UserCgroupIsolatorTest/0.ROOT_CGROUPS_UserCgroup
> ../../src/tests/mesos.cpp:722: Failure
> cgroups::mount(hierarchy, subsystem): '/tmp/mesos_test_cgroup/perf_event' 
> already exists in the file system
> -
> We cannot run any cgroups tests that require
> a hierarchy with subsystem 'perf_event'
> because we failed to find an existing hierarchy
> or create a new one (tried '/tmp/mesos_test_cgroup/perf_event').
> You can either remove all existing
> hierarchies, or disable this test case
> (i.e., --gtest_filter=-UserCgroupIsolatorTest/0.*).
> -
> ../../src/tests/mesos.cpp:776: Failure
> cgroups: '/tmp/mesos_test_cgroup/perf_event' is not a valid hierarchy
> [  FAILED  ] UserCgroupIsolatorTest/0.ROOT_CGROUPS_UserCgroup, where 
> TypeParam = mesos::internal::slave::CgroupsMemIsolatorProcess (1 ms)
> [--] 1 test from UserCgroupIsolatorTest/0 (1 ms total)
> [--] 1 test from UserCgroupIsolatorTest/1, where TypeParam = 
> mesos::internal::slave::CgroupsCpushareIsolatorProcess
> userdel: user 'mesos.test.unprivileged.user' does not exist
> [ RUN  ] UserCgroupIsolatorTest/1.ROOT_CGROUPS_UserCgroup
> ../../src/tests/mesos.cpp:722: Failure
> cgroups::mount(hierarchy, subsystem): '/tmp/mesos_test_cgroup/perf_event' 
> already exists in the file system
> -
> We cannot run any cgroups tests that require
> a hierarchy with subsystem 'perf_event'
> because we failed to find an existing hierarchy
> or create a new one (tried '/tmp/mesos_test_cgroup/perf_event').
> You can either remove all existing
> hierarchies, or disable this test case
> (i.e., --gtest_filter=-UserCgroupIsolatorTest/1.*).
> -
> ../../src/tests/mesos.cpp:776: Failure
> cgroups: '/tmp/mesos_test_cgroup/perf_event' is not a valid hierarchy
> [  FAILED  ] UserCgroupIsolatorTest/1.ROOT_CGROUPS_UserCgroup, where 
> TypeParam = mesos::internal::slave::CgroupsCpushareIsolatorProcess (4 ms)
> [--] 1 test from UserCgroupIsolatorTest/1 (5 ms total)
> [--] 1 test from UserCgroupIsolatorTest/2, where TypeParam = 
> mesos::internal::slave::CgroupsPerfEventIsolatorProcess
> userdel: user 'mesos.test.unprivileged.user' does not exist
> [ RUN  ] UserCgroupIsolatorTest/2.ROOT_CGROUPS_UserCgroup
> ../../src/tests/mesos.cpp:722: Failure
> cgroups::mount(hierarchy, subsystem): '/tmp/mesos_test_cgroup/perf_event' 
> already exists in the file system
> -
> We cannot run any cgroups tests that require
> a hierarchy with subsystem 'perf_event'
> because we failed to find an existing hierarchy
> or create a new one (tried '/tmp/mesos_test_cgroup/perf_event').
> You can either remove all existing
> hierarchies, or disable this test case
> (i.e., --gtest_filter=-UserCgroupIsolatorTest/2.*).
> -
> ../../src/tests/mesos.cpp:776: Failure
> cgroups: '/tmp/mesos_test_cgroup/perf_event' is not a v

[jira] [Commented] (MESOS-4039) PerfEventIsolatorTest.ROOT_CGROUPS_Sample fails

2015-12-02 Thread Bernd Mathiske (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-4039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15035687#comment-15035687
 ] 

Bernd Mathiske commented on MESOS-4039:
---

It seems that the right "fix" is adding documentation about needing to have 
perf support, respectively what to expect on which kinds of VMs on how to set 
them up. And also auto-disabling the test (ideally prompting a message about 
why then): https://issues.apache.org/jira/browse/MESOS-3471

> PerfEventIsolatorTest.ROOT_CGROUPS_Sample fails
> ---
>
> Key: MESOS-4039
> URL: https://issues.apache.org/jira/browse/MESOS-4039
> Project: Mesos
>  Issue Type: Bug
>Reporter: Greg Mann
>  Labels: mesosphere, test-fail
>
> PerfEventIsolatorTest.ROOT_CGROUPS_Sample fails on CentOS 6.6:
> {code}
> [--] 1 test from PerfEventIsolatorTest
> [ RUN  ] PerfEventIsolatorTest.ROOT_CGROUPS_Sample
> ../../src/tests/containerizer/isolator_tests.cpp:848: Failure
> isolator: Perf is not supported
> [  FAILED  ] PerfEventIsolatorTest.ROOT_CGROUPS_Sample (79 ms)
> [--] 1 test from PerfEventIsolatorTest (79 ms total)
> [--] Global test environment tear-down
> [==] 1 test from 1 test case ran. (86 ms total)
> [  PASSED  ] 0 tests.
> [  FAILED  ] 1 test, listed below:
> [  FAILED  ] PerfEventIsolatorTest.ROOT_CGROUPS_Sample
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (MESOS-4038) SlaveRecoveryTests, UserCgroupIsolatorTests fail on CentOS 6.6

2015-12-02 Thread Bernd Mathiske (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-4038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15035685#comment-15035685
 ] 

Bernd Mathiske edited comment on MESOS-4038 at 12/2/15 11:48 AM:
-

It seems that the right "fix" is adding documentation about needing to have 
perf support, respectively what to expect on which kinds of VMs on how to set 
them up. And also auto-disabling the test (ideally prompting a message about 
why then): https://issues.apache.org/jira/browse/MESOS-3471


was (Author: bernd-mesos):
It seems that the right "fix" is adding documentation about needing to have 
perf support, respectively what to expect on which kinds of VMs on how to set 
them up.

> SlaveRecoveryTests, UserCgroupIsolatorTests fail on CentOS 6.6
> --
>
> Key: MESOS-4038
> URL: https://issues.apache.org/jira/browse/MESOS-4038
> Project: Mesos
>  Issue Type: Bug
> Environment: CentOS 6.6
>Reporter: Greg Mann
>  Labels: mesosphere, test-failure
>
> All {{SlaveRecoveryTest.\*}} tests, 
> {{MesosContainerizerSlaveRecoveryTest.\*}} tests, and 
> {{UserCgroupIsolatorTest*}} tests fail on CentOS 6.6 with {{TypeParam = 
> mesos::internal::slave::MesosContainerizer}}. They all fail with the same 
> error:
> {code}
> [--] 1 test from SlaveRecoveryTest/0, where TypeParam = 
> mesos::internal::slave::MesosContainerizer
> [ RUN  ] SlaveRecoveryTest/0.ReconnectExecutor
> ../../src/tests/mesos.cpp:722: Failure
> cgroups::mount(hierarchy, subsystem): '/cgroup/perf_event' already exists in 
> the file system
> -
> We cannot run any cgroups tests that require
> a hierarchy with subsystem 'perf_event'
> because we failed to find an existing hierarchy
> or create a new one (tried '/cgroup/perf_event').
> You can either remove all existing
> hierarchies, or disable this test case
> (i.e., --gtest_filter=-SlaveRecoveryTest/0.*).
> -
> ../../src/tests/mesos.cpp:776: Failure
> cgroups: '/cgroup/perf_event' is not a valid hierarchy
> [  FAILED  ] SlaveRecoveryTest/0.ReconnectExecutor, where TypeParam = 
> mesos::internal::slave::MesosContainerizer (8 ms)
> [--] 1 test from SlaveRecoveryTest/0 (9 ms total)
> [--] Global test environment tear-down
> [==] 1 test from 1 test case ran. (15 ms total)
> [  PASSED  ] 0 tests.
> [  FAILED  ] 1 test, listed below:
> [  FAILED  ] SlaveRecoveryTest/0.ReconnectExecutor, where TypeParam = 
> mesos::internal::slave::MesosContainerizer
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-4038) SlaveRecoveryTests, UserCgroupIsolatorTests fail on CentOS 6.6

2015-12-02 Thread Bernd Mathiske (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-4038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15035685#comment-15035685
 ] 

Bernd Mathiske commented on MESOS-4038:
---

It seems that the right "fix" is adding documentation about needing to have 
perf support, respectively what to expect on which kinds of VMs on how to set 
them up.

> SlaveRecoveryTests, UserCgroupIsolatorTests fail on CentOS 6.6
> --
>
> Key: MESOS-4038
> URL: https://issues.apache.org/jira/browse/MESOS-4038
> Project: Mesos
>  Issue Type: Bug
> Environment: CentOS 6.6
>Reporter: Greg Mann
>  Labels: mesosphere, test-failure
>
> All {{SlaveRecoveryTest.\*}} tests, 
> {{MesosContainerizerSlaveRecoveryTest.\*}} tests, and 
> {{UserCgroupIsolatorTest*}} tests fail on CentOS 6.6 with {{TypeParam = 
> mesos::internal::slave::MesosContainerizer}}. They all fail with the same 
> error:
> {code}
> [--] 1 test from SlaveRecoveryTest/0, where TypeParam = 
> mesos::internal::slave::MesosContainerizer
> [ RUN  ] SlaveRecoveryTest/0.ReconnectExecutor
> ../../src/tests/mesos.cpp:722: Failure
> cgroups::mount(hierarchy, subsystem): '/cgroup/perf_event' already exists in 
> the file system
> -
> We cannot run any cgroups tests that require
> a hierarchy with subsystem 'perf_event'
> because we failed to find an existing hierarchy
> or create a new one (tried '/cgroup/perf_event').
> You can either remove all existing
> hierarchies, or disable this test case
> (i.e., --gtest_filter=-SlaveRecoveryTest/0.*).
> -
> ../../src/tests/mesos.cpp:776: Failure
> cgroups: '/cgroup/perf_event' is not a valid hierarchy
> [  FAILED  ] SlaveRecoveryTest/0.ReconnectExecutor, where TypeParam = 
> mesos::internal::slave::MesosContainerizer (8 ms)
> [--] 1 test from SlaveRecoveryTest/0 (9 ms total)
> [--] Global test environment tear-down
> [==] 1 test from 1 test case ran. (15 ms total)
> [  PASSED  ] 0 tests.
> [  FAILED  ] 1 test, listed below:
> [  FAILED  ] SlaveRecoveryTest/0.ReconnectExecutor, where TypeParam = 
> mesos::internal::slave::MesosContainerizer
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-4035) UserCgroupIsolatorTest.ROOT_CGROUPS_UserCgroup fails on CentOS 6.6

2015-12-02 Thread Bernd Mathiske (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-4035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bernd Mathiske updated MESOS-4035:
--
Target Version/s: 0.27.0

> UserCgroupIsolatorTest.ROOT_CGROUPS_UserCgroup fails on CentOS 6.6
> --
>
> Key: MESOS-4035
> URL: https://issues.apache.org/jira/browse/MESOS-4035
> Project: Mesos
>  Issue Type: Bug
> Environment: CentOS6.6
>Reporter: Gilbert Song
>
> `ROOT_CGROUPS_UserCgroup` on CentOS6.6 with 0.26rc3. The environment setup on 
> CentOS6.6 is based on latest update of /docs/getting-started.md. Either using 
> devtoolset-2 or devtoolset-3 returns the same failure. 
> If running `sudo ./bin/mesos-tests.sh 
> --gtest_filter="*ROOT_CGROUPS_UserCgroup*"`, it would return failures as 
> following log:
> {noformat}
> [==] Running 3 tests from 3 test cases.
> [--] Global test environment set-up.
> [--] 1 test from UserCgroupIsolatorTest/0, where TypeParam = 
> mesos::internal::slave::CgroupsMemIsolatorProcess
> userdel: user 'mesos.test.unprivileged.user' does not exist
> [ RUN  ] UserCgroupIsolatorTest/0.ROOT_CGROUPS_UserCgroup
> ../../src/tests/mesos.cpp:722: Failure
> cgroups::mount(hierarchy, subsystem): '/tmp/mesos_test_cgroup/perf_event' 
> already exists in the file system
> -
> We cannot run any cgroups tests that require
> a hierarchy with subsystem 'perf_event'
> because we failed to find an existing hierarchy
> or create a new one (tried '/tmp/mesos_test_cgroup/perf_event').
> You can either remove all existing
> hierarchies, or disable this test case
> (i.e., --gtest_filter=-UserCgroupIsolatorTest/0.*).
> -
> ../../src/tests/mesos.cpp:776: Failure
> cgroups: '/tmp/mesos_test_cgroup/perf_event' is not a valid hierarchy
> [  FAILED  ] UserCgroupIsolatorTest/0.ROOT_CGROUPS_UserCgroup, where 
> TypeParam = mesos::internal::slave::CgroupsMemIsolatorProcess (1 ms)
> [--] 1 test from UserCgroupIsolatorTest/0 (1 ms total)
> [--] 1 test from UserCgroupIsolatorTest/1, where TypeParam = 
> mesos::internal::slave::CgroupsCpushareIsolatorProcess
> userdel: user 'mesos.test.unprivileged.user' does not exist
> [ RUN  ] UserCgroupIsolatorTest/1.ROOT_CGROUPS_UserCgroup
> ../../src/tests/mesos.cpp:722: Failure
> cgroups::mount(hierarchy, subsystem): '/tmp/mesos_test_cgroup/perf_event' 
> already exists in the file system
> -
> We cannot run any cgroups tests that require
> a hierarchy with subsystem 'perf_event'
> because we failed to find an existing hierarchy
> or create a new one (tried '/tmp/mesos_test_cgroup/perf_event').
> You can either remove all existing
> hierarchies, or disable this test case
> (i.e., --gtest_filter=-UserCgroupIsolatorTest/1.*).
> -
> ../../src/tests/mesos.cpp:776: Failure
> cgroups: '/tmp/mesos_test_cgroup/perf_event' is not a valid hierarchy
> [  FAILED  ] UserCgroupIsolatorTest/1.ROOT_CGROUPS_UserCgroup, where 
> TypeParam = mesos::internal::slave::CgroupsCpushareIsolatorProcess (4 ms)
> [--] 1 test from UserCgroupIsolatorTest/1 (5 ms total)
> [--] 1 test from UserCgroupIsolatorTest/2, where TypeParam = 
> mesos::internal::slave::CgroupsPerfEventIsolatorProcess
> userdel: user 'mesos.test.unprivileged.user' does not exist
> [ RUN  ] UserCgroupIsolatorTest/2.ROOT_CGROUPS_UserCgroup
> ../../src/tests/mesos.cpp:722: Failure
> cgroups::mount(hierarchy, subsystem): '/tmp/mesos_test_cgroup/perf_event' 
> already exists in the file system
> -
> We cannot run any cgroups tests that require
> a hierarchy with subsystem 'perf_event'
> because we failed to find an existing hierarchy
> or create a new one (tried '/tmp/mesos_test_cgroup/perf_event').
> You can either remove all existing
> hierarchies, or disable this test case
> (i.e., --gtest_filter=-UserCgroupIsolatorTest/2.*).
> -
> ../../src/tests/mesos.cpp:776: Failure
> cgroups: '/tmp/mesos_test_cgroup/perf_event' is not a valid hierarchy
> [  FAILED  ] UserCgroupIsolatorTest/2.ROOT_CGROUPS_UserCgroup, where 
> TypeParam = mesos::internal::slave::CgroupsPerfEventIsolatorProcess (2 ms)
> [--] 1 test from UserCgroupIsolatorTest/2 (2 ms total)
> [--] Global test environment tear-down
> [==] 3 tests from 3 test cases ran. (349 ms total)
> [  PASSED  ] 0 tests.
> [  FAILED  ] 3 tests, listed below:
> [  FAILED  ] UserCgroupIsolatorTest/0.ROOT_CGROUPS_UserCgroup, where 
> TypeParam = mesos::internal::slave::CgroupsMemIsolatorProcess
> [  FAILED  ] UserCgroupIsolat

[jira] [Commented] (MESOS-4035) UserCgroupIsolatorTest.ROOT_CGROUPS_UserCgroup fails on CentOS 6.6

2015-12-02 Thread Bernd Mathiske (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-4035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15035684#comment-15035684
 ] 

Bernd Mathiske commented on MESOS-4035:
---

It seems that the right "fix" is adding documentation about needing to have 
perf support, respectively what to expect on which kinds of VMs on how to set 
them up.

> UserCgroupIsolatorTest.ROOT_CGROUPS_UserCgroup fails on CentOS 6.6
> --
>
> Key: MESOS-4035
> URL: https://issues.apache.org/jira/browse/MESOS-4035
> Project: Mesos
>  Issue Type: Bug
> Environment: CentOS6.6
>Reporter: Gilbert Song
>
> `ROOT_CGROUPS_UserCgroup` on CentOS6.6 with 0.26rc3. The environment setup on 
> CentOS6.6 is based on latest update of /docs/getting-started.md. Either using 
> devtoolset-2 or devtoolset-3 returns the same failure. 
> If running `sudo ./bin/mesos-tests.sh 
> --gtest_filter="*ROOT_CGROUPS_UserCgroup*"`, it would return failures as 
> following log:
> {noformat}
> [==] Running 3 tests from 3 test cases.
> [--] Global test environment set-up.
> [--] 1 test from UserCgroupIsolatorTest/0, where TypeParam = 
> mesos::internal::slave::CgroupsMemIsolatorProcess
> userdel: user 'mesos.test.unprivileged.user' does not exist
> [ RUN  ] UserCgroupIsolatorTest/0.ROOT_CGROUPS_UserCgroup
> ../../src/tests/mesos.cpp:722: Failure
> cgroups::mount(hierarchy, subsystem): '/tmp/mesos_test_cgroup/perf_event' 
> already exists in the file system
> -
> We cannot run any cgroups tests that require
> a hierarchy with subsystem 'perf_event'
> because we failed to find an existing hierarchy
> or create a new one (tried '/tmp/mesos_test_cgroup/perf_event').
> You can either remove all existing
> hierarchies, or disable this test case
> (i.e., --gtest_filter=-UserCgroupIsolatorTest/0.*).
> -
> ../../src/tests/mesos.cpp:776: Failure
> cgroups: '/tmp/mesos_test_cgroup/perf_event' is not a valid hierarchy
> [  FAILED  ] UserCgroupIsolatorTest/0.ROOT_CGROUPS_UserCgroup, where 
> TypeParam = mesos::internal::slave::CgroupsMemIsolatorProcess (1 ms)
> [--] 1 test from UserCgroupIsolatorTest/0 (1 ms total)
> [--] 1 test from UserCgroupIsolatorTest/1, where TypeParam = 
> mesos::internal::slave::CgroupsCpushareIsolatorProcess
> userdel: user 'mesos.test.unprivileged.user' does not exist
> [ RUN  ] UserCgroupIsolatorTest/1.ROOT_CGROUPS_UserCgroup
> ../../src/tests/mesos.cpp:722: Failure
> cgroups::mount(hierarchy, subsystem): '/tmp/mesos_test_cgroup/perf_event' 
> already exists in the file system
> -
> We cannot run any cgroups tests that require
> a hierarchy with subsystem 'perf_event'
> because we failed to find an existing hierarchy
> or create a new one (tried '/tmp/mesos_test_cgroup/perf_event').
> You can either remove all existing
> hierarchies, or disable this test case
> (i.e., --gtest_filter=-UserCgroupIsolatorTest/1.*).
> -
> ../../src/tests/mesos.cpp:776: Failure
> cgroups: '/tmp/mesos_test_cgroup/perf_event' is not a valid hierarchy
> [  FAILED  ] UserCgroupIsolatorTest/1.ROOT_CGROUPS_UserCgroup, where 
> TypeParam = mesos::internal::slave::CgroupsCpushareIsolatorProcess (4 ms)
> [--] 1 test from UserCgroupIsolatorTest/1 (5 ms total)
> [--] 1 test from UserCgroupIsolatorTest/2, where TypeParam = 
> mesos::internal::slave::CgroupsPerfEventIsolatorProcess
> userdel: user 'mesos.test.unprivileged.user' does not exist
> [ RUN  ] UserCgroupIsolatorTest/2.ROOT_CGROUPS_UserCgroup
> ../../src/tests/mesos.cpp:722: Failure
> cgroups::mount(hierarchy, subsystem): '/tmp/mesos_test_cgroup/perf_event' 
> already exists in the file system
> -
> We cannot run any cgroups tests that require
> a hierarchy with subsystem 'perf_event'
> because we failed to find an existing hierarchy
> or create a new one (tried '/tmp/mesos_test_cgroup/perf_event').
> You can either remove all existing
> hierarchies, or disable this test case
> (i.e., --gtest_filter=-UserCgroupIsolatorTest/2.*).
> -
> ../../src/tests/mesos.cpp:776: Failure
> cgroups: '/tmp/mesos_test_cgroup/perf_event' is not a valid hierarchy
> [  FAILED  ] UserCgroupIsolatorTest/2.ROOT_CGROUPS_UserCgroup, where 
> TypeParam = mesos::internal::slave::CgroupsPerfEventIsolatorProcess (2 ms)
> [--] 1 test from UserCgroupIsolatorTest/2 (2 ms total)
> [--] Global test environment tear-down
> [==] 3 tests from 3 test cases ran. (349 ms total)
> [  PASSED  ] 0 tests.
> [  FAILED

[jira] [Commented] (MESOS-3586) MemoryPressureMesosTest.CGROUPS_ROOT_Statistics and CGROUPS_ROOT_SlaveRecovery are flaky

2015-12-02 Thread Jan Schlicht (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-3586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15035656#comment-15035656
 ] 

Jan Schlicht commented on MESOS-3586:
-

Thanks! Can confirm that this fixes the flakiness with CentOS 7.1 for me. I was 
running {{sudo ./bin/mesos-tests.sh - 
--gtest_filter="MemoryPressureMesosTest.CGROUPS_ROOT_Statistics" 
--gtest_repeat=100 --gtest_break_on_failure}}

> MemoryPressureMesosTest.CGROUPS_ROOT_Statistics and 
> CGROUPS_ROOT_SlaveRecovery are flaky
> 
>
> Key: MESOS-3586
> URL: https://issues.apache.org/jira/browse/MESOS-3586
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Affects Versions: 0.24.0, 0.26.0
> Environment: Ubuntu 14.04, 3.13.0-32 generic
> Debian 8, gcc 4.9.2
>Reporter: Miguel Bernadin
>Assignee: Joseph Wu
>  Labels: flaky, flaky-test
>
> I am install Mesos 0.24.0 on 4 servers which have very similar hardware and 
> software configurations. 
> After performing ../configure, make, and make check some servers have 
> completed successfully and other failed on test [ RUN  ] 
> MemoryPressureMesosTest.CGROUPS_ROOT_Statistics.
> Is there something I should check in this test? 
> PERFORMED MAKE CHECK NODE-001
> [ RUN  ] MemoryPressureMesosTest.CGROUPS_ROOT_Statistics
> I1005 14:37:35.585067 38479 exec.cpp:133] Version: 0.24.0
> I1005 14:37:35.593789 38497 exec.cpp:207] Executor registered on slave 
> 20151005-143735-2393768202-35106-27900-S0
> Registered executor on svdidac038.techlabs.accenture.com
> Starting task 010b2fe9-4eac-4136-8a8a-6ce7665488b0
> Forked command at 38510
> sh -c 'while true; do dd count=512 bs=1M if=/dev/zero of=./temp; done'
> PERFORMED MAKE CHECK NODE-002
> [ RUN  ] MemoryPressureMesosTest.CGROUPS_ROOT_Statistics
> I1005 14:38:58.794112 36997 exec.cpp:133] Version: 0.24.0
> I1005 14:38:58.802851 37022 exec.cpp:207] Executor registered on slave 
> 20151005-143857-2360213770-50427-26325-S0
> Registered executor on svdidac039.techlabs.accenture.com
> Starting task 9bb317ba-41cb-44a4-b507-d1c85ceabc28
> sh -c 'while true; do dd count=512 bs=1M if=/dev/zero of=./temp; done'
> Forked command at 37028
> ../../src/tests/containerizer/memory_pressure_tests.cpp:145: Failure
> Expected: (usage.get().mem_medium_pressure_counter()) >= 
> (usage.get().mem_critical_pressure_counter()), actual: 5 vs 6
> 2015-10-05 
> 14:39:00,130:26325(0x2af08cc78700):ZOO_ERROR@handle_socket_error_msg@1697: 
> Socket [127.0.0.1:37198] zk retcode=-4, errno=111(Connection refused): server 
> refused to accept the client
> [  FAILED  ] MemoryPressureMesosTest.CGROUPS_ROOT_Statistics (4303 ms)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-4041) Default command executor do not have executor_id

2015-12-02 Thread Guangya Liu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-4041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15035618#comment-15035618
 ] 

Guangya Liu commented on MESOS-4041:


The reason is that frameork does not set the executor id, both mesos-executor 
and marathon: 
https://github.com/mesosphere/marathon/blob/master/src/main/scala/mesosphere/mesos/TaskBuilder.scala#L162
 

This will cause optimistic offer phase 1 does not work if no executor id as the 
mesos master cannot get executor id to evict.

If we want marathon also work with optimistic offer phase 1, then we may need 
to udpate marathon to set executor info even with command executor.

> Default command executor do not have executor_id
> 
>
> Key: MESOS-4041
> URL: https://issues.apache.org/jira/browse/MESOS-4041
> Project: Mesos
>  Issue Type: Bug
>Reporter: Guangya Liu
>
> I was doing some test with Marathon on top of mesos and found that when using 
> mesos command executor, the executor_id is always empty.
> {code}
> "state": "TASK_RUNNING",
>   "slave_id": "62c4d3e2-7c80-4d80-a0cd-57b2eced1d81-S0",
>   "resources": {
> "ports": "[33505-33505]",
> "mem": 16,
> "disk": 0,
> "cpus": 0.1
>   },
>   "name": "t1",
>   "id": "t1.ac8c4679-98cd-11e5-8b71-823a8274cef3",
>   "framework_id": "62c4d3e2-7c80-4d80-a0cd-57b2eced1d81-0001",
>   "executor_id": ""
> }
> {code}
> When I test with mesos-executor with command line executor, also no executor 
> id.
> {code}
> {
>   "unregistered_frameworks": [],
>   "frameworks": [
> {
>   "webui_url": "",
>   "user": "root",
>   "used_resources": {
> "mem": 256,
> "disk": 0,
> "cpus": 1
>   },
>   "unregistered_time": 0,
>   "id": "820e082f-be7c-4b59-abc5-9c02f9e8d66d-",
>   "hostname": "devstack007.cn.ibm.com",
>   "failover_timeout": 0,
>   "executors": [],
>   "completed_tasks": [],
>   "checkpoint": false,
>   "capabilities": [],
>   "active": true,
>   "name": "",
>   "offered_resources": {
> "mem": 0,
> "disk": 0,
> "cpus": 0
>   },
>   "offers": [],
>   "pid": 
> "scheduler-43ee3a42-9302-4815-8a07-500e42367f41@9.111.242.187:60454",
>   "registered_time": 1445236263.95058,
>   "resources": {
> "mem": 256,
> "disk": 0,
> "cpus": 1
>   },
>   "role": "*",
>   "tasks": [
> {
>   "statuses": [
> {
>   "timestamp": 1445236266.63443,
>   "state": "TASK_RUNNING",
>   "container_status": {
> "network_infos": [
>   {
> "ip_address": "9.111.242.187"
>   }
> ]
>   }
> }
>   ],
>   "state": "TASK_RUNNING",
>   "slave_id": "3e0df733-08b3-4883-b3fa-92bdc0c05b2f-S0",
>   "resources": {
> "mem": 256,
> "disk": 0,
> "cpus": 1
>   },
>   "name": "cluster-test",
>   "id": "cluster-test",
>   "framework_id": "820e082f-be7c-4b59-abc5-9c02f9e8d66d-",
>   "executor_id": ""
> }
>   ]
> }
>   ],
>   "completed_frameworks": []
> }
> {code}
> This caused the end use can not use http end point to kill some executors as 
> mesos require executor id when kill executors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-4041) Default command executor do not have executor_id

2015-12-02 Thread Guangya Liu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-4041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15035620#comment-15035620
 ] 

Guangya Liu commented on MESOS-4041:


[~jvanremoortere] [~kaysoky] any comments? Thanks.

> Default command executor do not have executor_id
> 
>
> Key: MESOS-4041
> URL: https://issues.apache.org/jira/browse/MESOS-4041
> Project: Mesos
>  Issue Type: Bug
>Reporter: Guangya Liu
>
> I was doing some test with Marathon on top of mesos and found that when using 
> mesos command executor, the executor_id is always empty.
> {code}
> "state": "TASK_RUNNING",
>   "slave_id": "62c4d3e2-7c80-4d80-a0cd-57b2eced1d81-S0",
>   "resources": {
> "ports": "[33505-33505]",
> "mem": 16,
> "disk": 0,
> "cpus": 0.1
>   },
>   "name": "t1",
>   "id": "t1.ac8c4679-98cd-11e5-8b71-823a8274cef3",
>   "framework_id": "62c4d3e2-7c80-4d80-a0cd-57b2eced1d81-0001",
>   "executor_id": ""
> }
> {code}
> When I test with mesos-executor with command line executor, also no executor 
> id.
> {code}
> {
>   "unregistered_frameworks": [],
>   "frameworks": [
> {
>   "webui_url": "",
>   "user": "root",
>   "used_resources": {
> "mem": 256,
> "disk": 0,
> "cpus": 1
>   },
>   "unregistered_time": 0,
>   "id": "820e082f-be7c-4b59-abc5-9c02f9e8d66d-",
>   "hostname": "devstack007.cn.ibm.com",
>   "failover_timeout": 0,
>   "executors": [],
>   "completed_tasks": [],
>   "checkpoint": false,
>   "capabilities": [],
>   "active": true,
>   "name": "",
>   "offered_resources": {
> "mem": 0,
> "disk": 0,
> "cpus": 0
>   },
>   "offers": [],
>   "pid": 
> "scheduler-43ee3a42-9302-4815-8a07-500e42367f41@9.111.242.187:60454",
>   "registered_time": 1445236263.95058,
>   "resources": {
> "mem": 256,
> "disk": 0,
> "cpus": 1
>   },
>   "role": "*",
>   "tasks": [
> {
>   "statuses": [
> {
>   "timestamp": 1445236266.63443,
>   "state": "TASK_RUNNING",
>   "container_status": {
> "network_infos": [
>   {
> "ip_address": "9.111.242.187"
>   }
> ]
>   }
> }
>   ],
>   "state": "TASK_RUNNING",
>   "slave_id": "3e0df733-08b3-4883-b3fa-92bdc0c05b2f-S0",
>   "resources": {
> "mem": 256,
> "disk": 0,
> "cpus": 1
>   },
>   "name": "cluster-test",
>   "id": "cluster-test",
>   "framework_id": "820e082f-be7c-4b59-abc5-9c02f9e8d66d-",
>   "executor_id": ""
> }
>   ]
> }
>   ],
>   "completed_frameworks": []
> }
> {code}
> This caused the end use can not use http end point to kill some executors as 
> mesos require executor id when kill executors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-3861) Authenticate quota requests

2015-12-02 Thread Alexander Rukletsov (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-3861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-3861:
---
Description: 
Quota requests need to be authenticated.
This ticket will authenticate quota requests using credentials provided by the 
{{Authorization}} field of the HTTP request. This is similar to how 
authentication is implemented in {{Master::Http}}.

  was:
Quota requests need to be authenticated.
This ticket will authenticate quota requests using credentials provided by the 
`Authorization` field of the HTTP request. This is similar to how 
authentication is implemented in `Master::Http`.


> Authenticate quota requests
> ---
>
> Key: MESOS-3861
> URL: https://issues.apache.org/jira/browse/MESOS-3861
> Project: Mesos
>  Issue Type: Task
>  Components: master
>Reporter: Jan Schlicht
>Assignee: Jan Schlicht
>  Labels: mesosphere, security
>
> Quota requests need to be authenticated.
> This ticket will authenticate quota requests using credentials provided by 
> the {{Authorization}} field of the HTTP request. This is similar to how 
> authentication is implemented in {{Master::Http}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-4025) SlaveRecoveryTest/0.GCExecutor is flaky.

2015-12-02 Thread Jan Schlicht (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-4025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jan Schlicht updated MESOS-4025:

  Sprint:   (was: Mesosphere Sprint 23)
Story Points:   (was: 3)

> SlaveRecoveryTest/0.GCExecutor is flaky.
> 
>
> Key: MESOS-4025
> URL: https://issues.apache.org/jira/browse/MESOS-4025
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Affects Versions: 0.26.0
>Reporter: Till Toenshoff
>  Labels: flaky, flaky-test, test
>
> Build was SSL enabled (--enable-ssl, --enable-libevent). The build was based 
> on 0.26.0-rc1.
> Testsuite was run as root.
> {noformat}
> sudo ./bin/mesos-tests.sh --gtest_break_on_failure --gtest_repeat=-1
> {noformat}
> {noformat}
> [ RUN  ] SlaveRecoveryTest/0.GCExecutor
> I1130 16:49:16.336833  1032 exec.cpp:136] Version: 0.26.0
> I1130 16:49:16.345212  1049 exec.cpp:210] Executor registered on slave 
> dde9fd4e-b016-4a99-9081-b047e9df9afa-S0
> Registered executor on ubuntu14
> Starting task 22c63bba-cbf8-46fd-b23a-5409d69e4114
> sh -c 'sleep 1000'
> Forked command at 1057
> ../../src/tests/mesos.cpp:779: Failure
> (cgroups::destroy(hierarchy, cgroup)).failure(): Failed to remove cgroup 
> '/sys/fs/cgroup/memory/mesos_test_e5edb2a8-9af3-441f-b991-613082f264e2/slave':
>  Device or resource busy
> *** Aborted at 1448902156 (unix time) try "date -d @1448902156" if you are 
> using GNU date ***
> PC: @  0x1443e9a testing::UnitTest::AddTestPartResult()
> *** SIGSEGV (@0x0) received by PID 27364 (TID 0x7f1bfdd2b800) from PID 0; 
> stack trace: ***
> @ 0x7f1be92b80b7 os::Linux::chained_handler()
> @ 0x7f1be92bc219 JVM_handle_linux_signal
> @ 0x7f1bf7bbc340 (unknown)
> @  0x1443e9a testing::UnitTest::AddTestPartResult()
> @  0x1438b99 testing::internal::AssertHelper::operator=()
> @   0xf0b3bb 
> mesos::internal::tests::ContainerizerTest<>::TearDown()
> @  0x1461882 
> testing::internal::HandleSehExceptionsInMethodIfSupported<>()
> @  0x145c6f8 
> testing::internal::HandleExceptionsInMethodIfSupported<>()
> @  0x143de4a testing::Test::Run()
> @  0x143e584 testing::TestInfo::Run()
> @  0x143ebca testing::TestCase::Run()
> @  0x1445312 testing::internal::UnitTestImpl::RunAllTests()
> @  0x14624a7 
> testing::internal::HandleSehExceptionsInMethodIfSupported<>()
> @  0x145d26e 
> testing::internal::HandleExceptionsInMethodIfSupported<>()
> @  0x14440ae testing::UnitTest::Run()
> @   0xd15cd4 RUN_ALL_TESTS()
> @   0xd158c1 main
> @ 0x7f1bf7808ec5 (unknown)
> @   0x913009 (unknown)
> {noformat}
> My Vagrantfile generator;
> {noformat}
> #!/usr/bin/env bash
> cat << EOF > Vagrantfile
> # -*- mode: ruby -*-" >
> # vi: set ft=ruby :
> Vagrant.configure(2) do |config|
>   # Disable shared folder to prevent certain kernel module dependencies.
>   config.vm.synced_folder ".", "/vagrant", disabled: true
>   config.vm.box = "bento/ubuntu-14.04"
>   config.vm.hostname = "${PLATFORM_NAME}"
>   config.vm.provider "virtualbox" do |vb|
> vb.memory = ${VAGRANT_MEM}
> vb.cpus = ${VAGRANT_CPUS}
> vb.customize ["modifyvm", :id, "--nictype1", "virtio"]
> vb.customize ["modifyvm", :id, "--natdnshostresolver1", "on"]
> vb.customize ["modifyvm", :id, "--natdnsproxy1", "on"]
>   end
>   config.vm.provider "vmware_fusion" do |vb|
> vb.memory = ${VAGRANT_MEM}
> vb.cpus = ${VAGRANT_CPUS}
>   end
>   config.vm.provision "file", source: "../test.sh", destination: "~/test.sh"
>   config.vm.provision "shell", inline: <<-SHELL
> sudo apt-get update
> sudo apt-get -y install openjdk-7-jdk autoconf libtool
> sudo apt-get -y install build-essential python-dev python-boto  \
> libcurl4-nss-dev libsasl2-dev maven \
> libapr1-dev libsvn-dev libssl-dev libevent-dev
> sudo apt-get -y install git
> sudo wget -qO- https://get.docker.com/ | sh
>   SHELL
> end
> EOF
> {noformat}
> The problem is kicking in frequently in my tests - I'ld say > 10% but less 
> than 50%.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (MESOS-4041) Default command executor do not have executor_id

2015-12-02 Thread Guangya Liu (JIRA)

Guangya Liu created MESOS-4041:
--

 Summary: Default command executor do not have executor_id
 Key: MESOS-4041
 URL: https://issues.apache.org/jira/browse/MESOS-4041
 Project: Mesos
  Issue Type: Bug
Reporter: Guangya Liu


I was doing some test with Marathon on top of mesos and found that when using 
mesos command executor, the executor_id is always empty.

{code}
"state": "TASK_RUNNING",
  "slave_id": "62c4d3e2-7c80-4d80-a0cd-57b2eced1d81-S0",
  "resources": {
"ports": "[33505-33505]",
"mem": 16,
"disk": 0,
"cpus": 0.1
  },
  "name": "t1",
  "id": "t1.ac8c4679-98cd-11e5-8b71-823a8274cef3",
  "framework_id": "62c4d3e2-7c80-4d80-a0cd-57b2eced1d81-0001",
  "executor_id": ""
}
{code}

When I test with mesos-executor with command line executor, also no executor id.
{code}
{
  "unregistered_frameworks": [],
  "frameworks": [
{
  "webui_url": "",
  "user": "root",
  "used_resources": {
"mem": 256,
"disk": 0,
"cpus": 1
  },
  "unregistered_time": 0,
  "id": "820e082f-be7c-4b59-abc5-9c02f9e8d66d-",
  "hostname": "devstack007.cn.ibm.com",
  "failover_timeout": 0,
  "executors": [],
  "completed_tasks": [],
  "checkpoint": false,
  "capabilities": [],
  "active": true,
  "name": "",
  "offered_resources": {
"mem": 0,
"disk": 0,
"cpus": 0
  },
  "offers": [],
  "pid": 
"scheduler-43ee3a42-9302-4815-8a07-500e42367f41@9.111.242.187:60454",
  "registered_time": 1445236263.95058,
  "resources": {
"mem": 256,
"disk": 0,
"cpus": 1
  },
  "role": "*",
  "tasks": [
{
  "statuses": [
{
  "timestamp": 1445236266.63443,
  "state": "TASK_RUNNING",
  "container_status": {
"network_infos": [
  {
"ip_address": "9.111.242.187"
  }
]
  }
}
  ],
  "state": "TASK_RUNNING",
  "slave_id": "3e0df733-08b3-4883-b3fa-92bdc0c05b2f-S0",
  "resources": {
"mem": 256,
"disk": 0,
"cpus": 1
  },
  "name": "cluster-test",
  "id": "cluster-test",
  "framework_id": "820e082f-be7c-4b59-abc5-9c02f9e8d66d-",
  "executor_id": ""
}
  ]
}
  ],
  "completed_frameworks": []
}
{code}

This caused the end use can not use http end point to kill some executors as 
mesos require executor id when kill executors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-4040) Clang's address sanitizer reports heap-use-after-free in ExamplesTest.EventCallFramework

2015-12-02 Thread Benjamin Bannier (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-4040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Bannier updated MESOS-4040:

Attachment: asan.log

> Clang's address sanitizer reports heap-use-after-free in 
> ExamplesTest.EventCallFramework
> 
>
> Key: MESOS-4040
> URL: https://issues.apache.org/jira/browse/MESOS-4040
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.26.0
>Reporter: Benjamin Bannier
> Attachments: asan.log
>
>
> For a libevent- and ssl-enabled debug build under ubuntu14.04 with clang-3.6 
> and {{CXXFLAGS=-fsanitize=address}} address sanitizer reports a 
> use-after-free from {{ExamplesTest.EventCallFramework}} (log attached).
> If this is not a false positive in could lead to all kinds of issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (MESOS-4040) Clang's address sanitizer reports heap-use-after-free in ExamplesTest.EventCallFramework

2015-12-02 Thread Benjamin Bannier (JIRA)

Benjamin Bannier created MESOS-4040:
---

 Summary: Clang's address sanitizer reports heap-use-after-free in 
ExamplesTest.EventCallFramework
 Key: MESOS-4040
 URL: https://issues.apache.org/jira/browse/MESOS-4040
 Project: Mesos
  Issue Type: Bug
Affects Versions: 0.26.0
Reporter: Benjamin Bannier
 Attachments: asan.log

For a libevent- and ssl-enabled debug build under ubuntu14.04 with clang-3.6 
and {{CXXFLAGS=-fsanitize=address}} address sanitizer reports a use-after-free 
from {{ExamplesTest.EventCallFramework}} (log attached).

If this is not a false positive in could lead to all kinds of issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

61 matches

Mail list logo