[jira] [Commented] (MESOS-3892) Add a helper function to the Agent to retrieve the list of executors that are using optimistically offered, revocable resources.
[ https://issues.apache.org/jira/browse/MESOS-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15037416#comment-15037416 ] Guangya Liu commented on MESOS-3892: I know that we already have lot of discussions to talk whether the master can kill task or not. For some frameworks, it may not implement the {{termiateTask()}} so that once the executor was killed, the task will become TASK_FAILED and all resources will be recovered but task continues running, this will cause the host over committed, kubernetes on mesos is such a case, I filed a ticket here to trace the k8s on mesos issue: https://github.com/kubernetes/kubernetes/issues/18066 Killing task directly by master may not able to make sure QoS but it can make sure the resources usage is correct in case some executors do not implement {{terminateTask()}} API. Is it possible to add a new field in Framework to control kill task or executor when the tasks on this framework is being preempted. > Add a helper function to the Agent to retrieve the list of executors that are > using optimistically offered, revocable resources. > > > Key: MESOS-3892 > URL: https://issues.apache.org/jira/browse/MESOS-3892 > Project: Mesos > Issue Type: Bug >Reporter: Artem Harutyunyan >Assignee: Klaus Ma > Labels: mesosphere > > {noformat} > class Slave { > ... > // How the master currently keeps track of executors. > hashmap> executors; > ... > // Returns the list of executors that are using optimistically- > // offered, revocable resources. > list getEvictableExecutors() { ... } > ... > } > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4049) Allow user to control behavior of partitioned agents/tasks
[ https://issues.apache.org/jira/browse/MESOS-4049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15037282#comment-15037282 ] Klaus Ma commented on MESOS-4049: - And which timepoint would you like to report the new state to framework? Ping failed or configurable e.g. after # ping failed (< max_slave_ping_times)? > Allow user to control behavior of partitioned agents/tasks > -- > > Key: MESOS-4049 > URL: https://issues.apache.org/jira/browse/MESOS-4049 > Project: Mesos > Issue Type: Improvement > Components: master, slave >Reporter: Neil Conway > Labels: mesosphere > > At present, if an agent is partitioned away from the master, the master waits > for a period of time (see MESOS-4048) before deciding that the agent is dead. > Then it marks the agent as lost, sends {{TASK_LOST}} messages for all the > tasks running on the agent, and instructs the agent to shutdown. > Although this behavior is desirable for some/many users, it is not ideal for > everyone. For example: > * Some users might want to aggressively start a new replacement task (e.g., > after one or two ping timeouts are missed); then when the old copy of the > task comes back, they might want to make an intelligent decision about how to > reconcile this situation (e.g., kill old, kill new, allow both to continue > running). > * Some frameworks might want different behavior from other frameworks, or to > treat some tasks differently from other tasks. For example, if a task has a > huge amount of state that would need to be regenerated to spin up another > instance, the user might want to wait longer before starting a new task to > increase the chance that the old task will reappear. > To do this, we'd need to change task state so that a task can go from > {{RUNNING}} to a new state (say {{UNKNOWN}} or {{WANDERING}}), and then from > that state back to {{RUNNING}} (or perhaps we could keep the current > "mark-lost-after-timeout" behavior as an option, in which case {{UNKNOWN}} > could also transition to {{LOST}}). The agent would also keep its old > {{slaveId}} when it reconnects. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4045) NumifyTest.HexNumberTest fails
[ https://issues.apache.org/jira/browse/MESOS-4045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15037224#comment-15037224 ] Cong Wang commented on MESOS-4045: -- Oh... It passes on my Linux machine... I am trying to reproduce it on MacPro. > NumifyTest.HexNumberTest fails > -- > > Key: MESOS-4045 > URL: https://issues.apache.org/jira/browse/MESOS-4045 > Project: Mesos > Issue Type: Bug > Components: stout > Environment: Mac OS X 10.11.1 >Reporter: Michael Park > > {noformat} > [==] Running 1 test from 1 test case. > [--] Global test environment set-up. > [--] 1 test from NumifyTest > [ RUN ] NumifyTest.HexNumberTest > ../../../../3rdparty/libprocess/3rdparty/stout/tests/numify_tests.cpp:44: > Failure > Value of: numify("0x10.9").isError() > Actual: false > Expected: true > [ FAILED ] NumifyTest.HexNumberTest (0 ms) > [--] 1 test from NumifyTest (0 ms total) > [--] Global test environment tear-down > [==] 1 test from 1 test case ran. (0 ms total) > [ PASSED ] 0 tests. > [ FAILED ] 1 test, listed below: > [ FAILED ] NumifyTest.HexNumberTest > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4049) Allow user to control behavior of partitioned agents/tasks
[ https://issues.apache.org/jira/browse/MESOS-4049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15037207#comment-15037207 ] Guangya Liu commented on MESOS-4049: [~neilc]Got it, thanks! Then I think that we may need to consider the case if a UNKNOW or WANDERING task got killed? Shall we mark this as ZOMBIE and when the host come back, just mark the ZOMBIE as TASK_FINISH. > Allow user to control behavior of partitioned agents/tasks > -- > > Key: MESOS-4049 > URL: https://issues.apache.org/jira/browse/MESOS-4049 > Project: Mesos > Issue Type: Improvement > Components: master, slave >Reporter: Neil Conway > Labels: mesosphere > > At present, if an agent is partitioned away from the master, the master waits > for a period of time (see MESOS-4048) before deciding that the agent is dead. > Then it marks the agent as lost, sends {{TASK_LOST}} messages for all the > tasks running on the agent, and instructs the agent to shutdown. > Although this behavior is desirable for some/many users, it is not ideal for > everyone. For example: > * Some users might want to aggressively start a new replacement task (e.g., > after one or two ping timeouts are missed); then when the old copy of the > task comes back, they might want to make an intelligent decision about how to > reconcile this situation (e.g., kill old, kill new, allow both to continue > running). > * Some frameworks might want different behavior from other frameworks, or to > treat some tasks differently from other tasks. For example, if a task has a > huge amount of state that would need to be regenerated to spin up another > instance, the user might want to wait longer before starting a new task to > increase the chance that the old task will reappear. > To do this, we'd need to change task state so that a task can go from > {{RUNNING}} to a new state (say {{UNKNOWN}} or {{WANDERING}}), and then from > that state back to {{RUNNING}} (or perhaps we could keep the current > "mark-lost-after-timeout" behavior as an option, in which case {{UNKNOWN}} > could also transition to {{LOST}}). The agent would also keep its old > {{slaveId}} when it reconnects. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4049) Allow user to control behavior of partitioned agents/tasks
[ https://issues.apache.org/jira/browse/MESOS-4049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15037154#comment-15037154 ] Neil Conway commented on MESOS-4049: Yes. > Allow user to control behavior of partitioned agents/tasks > -- > > Key: MESOS-4049 > URL: https://issues.apache.org/jira/browse/MESOS-4049 > Project: Mesos > Issue Type: Improvement > Components: master, slave >Reporter: Neil Conway > Labels: mesosphere > > At present, if an agent is partitioned away from the master, the master waits > for a period of time (see MESOS-4048) before deciding that the agent is dead. > Then it marks the agent as lost, sends {{TASK_LOST}} messages for all the > tasks running on the agent, and instructs the agent to shutdown. > Although this behavior is desirable for some/many users, it is not ideal for > everyone. For example: > * Some users might want to aggressively start a new replacement task (e.g., > after one or two ping timeouts are missed); then when the old copy of the > task comes back, they might want to make an intelligent decision about how to > reconcile this situation (e.g., kill old, kill new, allow both to continue > running). > * Some frameworks might want different behavior from other frameworks, or to > treat some tasks differently from other tasks. For example, if a task has a > huge amount of state that would need to be regenerated to spin up another > instance, the user might want to wait longer before starting a new task to > increase the chance that the old task will reappear. > To do this, we'd need to change task state so that a task can go from > {{RUNNING}} to a new state (say {{UNKNOWN}} or {{WANDERING}}), and then from > that state back to {{RUNNING}} (or perhaps we could keep the current > "mark-lost-after-timeout" behavior as an option, in which case {{UNKNOWN}} > could also transition to {{LOST}}). The agent would also keep its old > {{slaveId}} when it reconnects. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4049) Allow user to control behavior of partitioned agents/tasks
[ https://issues.apache.org/jira/browse/MESOS-4049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15037152#comment-15037152 ] Klaus Ma commented on MESOS-4049: - I like {{replacement task}} feature :). Just want to confirm: in this JIRA, Mesos only provide a new state about connection glitch ({{UNKNOWN}} or {{WANDERING}}); "replacement task" is handled by framework. > Allow user to control behavior of partitioned agents/tasks > -- > > Key: MESOS-4049 > URL: https://issues.apache.org/jira/browse/MESOS-4049 > Project: Mesos > Issue Type: Improvement > Components: master, slave >Reporter: Neil Conway > Labels: mesosphere > > At present, if an agent is partitioned away from the master, the master waits > for a period of time (see MESOS-4048) before deciding that the agent is dead. > Then it marks the agent as lost, sends {{TASK_LOST}} messages for all the > tasks running on the agent, and instructs the agent to shutdown. > Although this behavior is desirable for some/many users, it is not ideal for > everyone. For example: > * Some users might want to aggressively start a new replacement task (e.g., > after one or two ping timeouts are missed); then when the old copy of the > task comes back, they might want to make an intelligent decision about how to > reconcile this situation (e.g., kill old, kill new, allow both to continue > running). > * Some frameworks might want different behavior from other frameworks, or to > treat some tasks differently from other tasks. For example, if a task has a > huge amount of state that would need to be regenerated to spin up another > instance, the user might want to wait longer before starting a new task to > increase the chance that the old task will reappear. > To do this, we'd need to change task state so that a task can go from > {{RUNNING}} to a new state (say {{UNKNOWN}} or {{WANDERING}}), and then from > that state back to {{RUNNING}} (or perhaps we could keep the current > "mark-lost-after-timeout" behavior as an option, in which case {{UNKNOWN}} > could also transition to {{LOST}}). The agent would also keep its old > {{slaveId}} when it reconnects. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4048) Consider unifying slave timeout behavior between steady state and master failover
[ https://issues.apache.org/jira/browse/MESOS-4048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15037144#comment-15037144 ] Klaus Ma commented on MESOS-4048: - My understanding is that: {{max_slave_ping_timeouts}} + {{slave_ping_timeout}} is used to trigger TCP disconnected event; so master wait {{slave_reregister_timeout}} for slave to re-register. If master got TCP disconnected event, it should not ping Slave by {{max_slave_ping_timeouts}} + {{slave_ping_timeout}}. {{max_slave_ping_timeouts}} + {{slave_ping_timeout}} is used to simulate TCP-KeepAlive which is not well supported in some OS. > Consider unifying slave timeout behavior between steady state and master > failover > - > > Key: MESOS-4048 > URL: https://issues.apache.org/jira/browse/MESOS-4048 > Project: Mesos > Issue Type: Improvement > Components: master, slave >Reporter: Neil Conway >Priority: Minor > Labels: mesosphere > > Currently, there are two timeouts that control what happens when an agent is > partitioned from the master: > 1. {{max_slave_ping_timeouts}} + {{slave_ping_timeout}} controls how long the > master waits before declaring a slave to be dead in the "steady state" > 2. {{slave_reregister_timeout}} controls how long the master waits for a > slave to reregister after master failover. > It is unclear whether these two cases really merit being treated differently > -- it might be simpler for operators to configure a single timeout that > controls how long the master waits before declaring that a slave is dead. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4049) Allow user to control behavior of partitioned agents/tasks
[ https://issues.apache.org/jira/browse/MESOS-4049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15037116#comment-15037116 ] Neil Conway commented on MESOS-4049: I'm not sure {{ZOMBIE}} accurately describes the intended behavior -- for example, in Unix a zombie process cannot come back to life. A zombie process is definitely dead (it just hasn't been properly cleaned up), whereas in this case the true state of the task is not known (to the master/framework). > Allow user to control behavior of partitioned agents/tasks > -- > > Key: MESOS-4049 > URL: https://issues.apache.org/jira/browse/MESOS-4049 > Project: Mesos > Issue Type: Improvement > Components: master, slave >Reporter: Neil Conway > Labels: mesosphere > > At present, if an agent is partitioned away from the master, the master waits > for a period of time (see MESOS-4048) before deciding that the agent is dead. > Then it marks the agent as lost, sends {{TASK_LOST}} messages for all the > tasks running on the agent, and instructs the agent to shutdown. > Although this behavior is desirable for some/many users, it is not ideal for > everyone. For example: > * Some users might want to aggressively start a new replacement task (e.g., > after one or two ping timeouts are missed); then when the old copy of the > task comes back, they might want to make an intelligent decision about how to > reconcile this situation (e.g., kill old, kill new, allow both to continue > running). > * Some frameworks might want different behavior from other frameworks, or to > treat some tasks differently from other tasks. For example, if a task has a > huge amount of state that would need to be regenerated to spin up another > instance, the user might want to wait longer before starting a new task to > increase the chance that the old task will reappear. > To do this, we'd need to change task state so that a task can go from > {{RUNNING}} to a new state (say {{UNKNOWN}} or {{WANDERING}}), and then from > that state back to {{RUNNING}} (or perhaps we could keep the current > "mark-lost-after-timeout" behavior as an option, in which case {{UNKNOWN}} > could also transition to {{LOST}}). The agent would also keep its old > {{slaveId}} when it reconnects. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4049) Allow user to control behavior of partitioned agents/tasks
[ https://issues.apache.org/jira/browse/MESOS-4049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15037109#comment-15037109 ] Guangya Liu commented on MESOS-4049: It would be great to add such intelligent feature. BTW: It might be more align with linux concept if we can name the transaction task state as "ZOMBIE". > Allow user to control behavior of partitioned agents/tasks > -- > > Key: MESOS-4049 > URL: https://issues.apache.org/jira/browse/MESOS-4049 > Project: Mesos > Issue Type: Improvement > Components: master, slave >Reporter: Neil Conway > Labels: mesosphere > > At present, if an agent is partitioned away from the master, the master waits > for a period of time (see MESOS-4048) before deciding that the agent is dead. > Then it marks the agent as lost, sends {{TASK_LOST}} messages for all the > tasks running on the agent, and instructs the agent to shutdown. > Although this behavior is desirable for some/many users, it is not ideal for > everyone. For example: > * Some users might want to aggressively start a new replacement task (e.g., > after one or two ping timeouts are missed); then when the old copy of the > task comes back, they might want to make an intelligent decision about how to > reconcile this situation (e.g., kill old, kill new, allow both to continue > running). > * Some frameworks might want different behavior from other frameworks, or to > treat some tasks differently from other tasks. For example, if a task has a > huge amount of state that would need to be regenerated to spin up another > instance, the user might want to wait longer before starting a new task to > increase the chance that the old task will reappear. > To do this, we'd need to change task state so that a task can go from > {{RUNNING}} to a new state (say {{UNKNOWN}} or {{WANDERING}}), and then from > that state back to {{RUNNING}} (or perhaps we could keep the current > "mark-lost-after-timeout" behavior as an option, in which case {{UNKNOWN}} > could also transition to {{LOST}}). The agent would also keep its old > {{slaveId}} when it reconnects. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4036) Install instructions for CentOS 6.6 lead to errors running `perf`
[ https://issues.apache.org/jira/browse/MESOS-4036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Mann updated MESOS-4036: - Summary: Install instructions for CentOS 6.6 lead to errors running `perf` (was: perf will not run on CentOS 6.6) > Install instructions for CentOS 6.6 lead to errors running `perf` > - > > Key: MESOS-4036 > URL: https://issues.apache.org/jira/browse/MESOS-4036 > Project: Mesos > Issue Type: Bug > Environment: CentOS 6.6 >Reporter: Greg Mann > Labels: mesosphere > > After using the current installation instructions in the getting started > documentation, {{perf}} will not run on CentOS 6.6 because the version of > elfutils included in devtoolset-2 is not compatible with the version of > {{perf}} installed by {{yum}}. Installing and using devtoolset-3, however > (http://linux.web.cern.ch/linux/scientific6/docs/softwarecollections.shtml) > fixes this issue. This could be resolved by updating the getting started > documentation to recommend installing devtoolset-3. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4053) MemoryPressureMesosTest tests fail on CentOS 6.6
[ https://issues.apache.org/jira/browse/MESOS-4053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Mann updated MESOS-4053: - Description: {{MemoryPressureMesosTest.CGROUPS_ROOT_Statistics}} and {{MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery}} fail on CentOS 6.6. It seems that mounted cgroups are not properly cleaned up after previous tests, so multiple hierarchies are detected and thus an error is produced: {code} [ RUN ] MemoryPressureMesosTest.CGROUPS_ROOT_Statistics ../../src/tests/mesos.cpp:849: Failure Value of: _baseHierarchy.get() Actual: "/cgroup" Expected: baseHierarchy Which is: "/tmp/mesos_test_cgroup" - Multiple cgroups base hierarchies detected: '/tmp/mesos_test_cgroup' '/cgroup' Mesos does not support multiple cgroups base hierarchies. Please unmount the corresponding (or all) subsystems. - ../../src/tests/mesos.cpp:932: Failure (cgroups::destroy(hierarchy, cgroup)).failure(): Failed to remove cgroup '/tmp/mesos_test_cgroup/perf_event/mesos_test': Device or resource busy [ FAILED ] MemoryPressureMesosTest.CGROUPS_ROOT_Statistics (12 ms) [ RUN ] MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery ../../src/tests/mesos.cpp:849: Failure Value of: _baseHierarchy.get() Actual: "/cgroup" Expected: baseHierarchy Which is: "/tmp/mesos_test_cgroup" - Multiple cgroups base hierarchies detected: '/tmp/mesos_test_cgroup' '/cgroup' Mesos does not support multiple cgroups base hierarchies. Please unmount the corresponding (or all) subsystems. - ../../src/tests/mesos.cpp:932: Failure (cgroups::destroy(hierarchy, cgroup)).failure(): Failed to remove cgroup '/tmp/mesos_test_cgroup/perf_event/mesos_test': Device or resource busy [ FAILED ] MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery (7 ms) {code} was: {{MemoryPressureMesosTest.CGROUPS_ROOT_Statistics}} and {{MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery}} fail on CentOS 6.6. It seems the tests fail to correctly identify the base cgroups hierarchy: {code} [ RUN ] MemoryPressureMesosTest.CGROUPS_ROOT_Statistics ../../src/tests/mesos.cpp:849: Failure Value of: _baseHierarchy.get() Actual: "/cgroup" Expected: baseHierarchy Which is: "/tmp/mesos_test_cgroup" - Multiple cgroups base hierarchies detected: '/tmp/mesos_test_cgroup' '/cgroup' Mesos does not support multiple cgroups base hierarchies. Please unmount the corresponding (or all) subsystems. - ../../src/tests/mesos.cpp:932: Failure (cgroups::destroy(hierarchy, cgroup)).failure(): Failed to remove cgroup '/tmp/mesos_test_cgroup/perf_event/mesos_test': Device or resource busy [ FAILED ] MemoryPressureMesosTest.CGROUPS_ROOT_Statistics (12 ms) [ RUN ] MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery ../../src/tests/mesos.cpp:849: Failure Value of: _baseHierarchy.get() Actual: "/cgroup" Expected: baseHierarchy Which is: "/tmp/mesos_test_cgroup" - Multiple cgroups base hierarchies detected: '/tmp/mesos_test_cgroup' '/cgroup' Mesos does not support multiple cgroups base hierarchies. Please unmount the corresponding (or all) subsystems. - ../../src/tests/mesos.cpp:932: Failure (cgroups::destroy(hierarchy, cgroup)).failure(): Failed to remove cgroup '/tmp/mesos_test_cgroup/perf_event/mesos_test': Device or resource busy [ FAILED ] MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery (7 ms) {code} > MemoryPressureMesosTest tests fail on CentOS 6.6 > > > Key: MESOS-4053 > URL: https://issues.apache.org/jira/browse/MESOS-4053 > Project: Mesos > Issue Type: Bug > Environment: CentOS 6.6 >Reporter: Greg Mann > Labels: mesosphere, test-failure > > {{MemoryPressureMesosTest.CGROUPS_ROOT_Statistics}} and > {{MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery}} fail on CentOS 6.6. It > seems that mounted cgroups are not properly cleaned up after previous tests, > so multiple hierarchies are detected and thus an error is produced: > {code} > [ RUN ] MemoryPressureMesosTest.CGROUPS_ROOT_Statistics > ../../src/tests/mesos.cpp:849: Failure > Value of: _baseHierarchy.get() > Actual: "/cgroup" > Expected: baseHierarchy > Which is: "/tmp/mesos_test_cgroup" > - > Multiple cgroups base hierarchies detected: > '/tmp/mesos_test_cgroup' > '/cgroup' > Mesos does not support multiple cgroups base hierarchies. > Ple
[jira] [Updated] (MESOS-4053) MemoryPressureMesosTest tests fail on CentOS 6.6
[ https://issues.apache.org/jira/browse/MESOS-4053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Mann updated MESOS-4053: - Description: {{MemoryPressureMesosTest.CGROUPS_ROOT_Statistics}} and {{MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery}} fail on CentOS 6.6. It seems the tests fail to correctly identify the base cgroups hierarchy: {code} [ RUN ] MemoryPressureMesosTest.CGROUPS_ROOT_Statistics ../../src/tests/mesos.cpp:849: Failure Value of: _baseHierarchy.get() Actual: "/cgroup" Expected: baseHierarchy Which is: "/tmp/mesos_test_cgroup" - Multiple cgroups base hierarchies detected: '/tmp/mesos_test_cgroup' '/cgroup' Mesos does not support multiple cgroups base hierarchies. Please unmount the corresponding (or all) subsystems. - ../../src/tests/mesos.cpp:932: Failure (cgroups::destroy(hierarchy, cgroup)).failure(): Failed to remove cgroup '/tmp/mesos_test_cgroup/perf_event/mesos_test': Device or resource busy [ FAILED ] MemoryPressureMesosTest.CGROUPS_ROOT_Statistics (12 ms) [ RUN ] MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery ../../src/tests/mesos.cpp:849: Failure Value of: _baseHierarchy.get() Actual: "/cgroup" Expected: baseHierarchy Which is: "/tmp/mesos_test_cgroup" - Multiple cgroups base hierarchies detected: '/tmp/mesos_test_cgroup' '/cgroup' Mesos does not support multiple cgroups base hierarchies. Please unmount the corresponding (or all) subsystems. - ../../src/tests/mesos.cpp:932: Failure (cgroups::destroy(hierarchy, cgroup)).failure(): Failed to remove cgroup '/tmp/mesos_test_cgroup/perf_event/mesos_test': Device or resource busy [ FAILED ] MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery (7 ms) {code} was: {{MemoryPressureMesosTest.CGROUPS_ROOT_Statistics}} and {{MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery}} fail on CentOS 6.6. It seems the tests fail to correctly identify the base cgroups hierarchy: {code} [ RUN ] CgroupsAnyHierarchyMemoryPressureTest.ROOT_IncreasePageCache [ OK ] CgroupsAnyHierarchyMemoryPressureTest.ROOT_IncreasePageCache (2091 ms) ../../src/tests/containerizer/cgroups_tests.cpp:84: Failure (cgroups::cleanup(TEST_CGROUPS_HIERARCHY)).failure(): Operation not permitted [--] 2 tests from CgroupsAnyHierarchyMemoryPressureTest (2109 ms total) [--] 2 tests from MemoryPressureMesosTest 1+0 records in 1+0 records out 1048576 bytes (1.0 MB) copied, 0.0011065 s, 948 MB/s [ RUN ] MemoryPressureMesosTest.CGROUPS_ROOT_Statistics ../../src/tests/mesos.cpp:849: Failure Value of: _baseHierarchy.get() Actual: "/cgroup" Expected: baseHierarchy Which is: "/tmp/mesos_test_cgroup" - Multiple cgroups base hierarchies detected: '/tmp/mesos_test_cgroup' '/cgroup' Mesos does not support multiple cgroups base hierarchies. Please unmount the corresponding (or all) subsystems. - ../../src/tests/mesos.cpp:932: Failure (cgroups::destroy(hierarchy, cgroup)).failure(): Failed to remove cgroup '/tmp/mesos_test_cgroup/perf_event/mesos_test': Device or resource busy [ FAILED ] MemoryPressureMesosTest.CGROUPS_ROOT_Statistics (12 ms) [ RUN ] MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery ../../src/tests/mesos.cpp:849: Failure Value of: _baseHierarchy.get() Actual: "/cgroup" Expected: baseHierarchy Which is: "/tmp/mesos_test_cgroup" - Multiple cgroups base hierarchies detected: '/tmp/mesos_test_cgroup' '/cgroup' Mesos does not support multiple cgroups base hierarchies. Please unmount the corresponding (or all) subsystems. - ../../src/tests/mesos.cpp:932: Failure (cgroups::destroy(hierarchy, cgroup)).failure(): Failed to remove cgroup '/tmp/mesos_test_cgroup/perf_event/mesos_test': Device or resource busy [ FAILED ] MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery (7 ms) {code} > MemoryPressureMesosTest tests fail on CentOS 6.6 > > > Key: MESOS-4053 > URL: https://issues.apache.org/jira/browse/MESOS-4053 > Project: Mesos > Issue Type: Bug > Environment: CentOS 6.6 >Reporter: Greg Mann > Labels: mesosphere, test-failure > > {{MemoryPressureMesosTest.CGROUPS_ROOT_Statistics}} and > {{MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery}} fail on CentOS 6.6. It > seems the tests fail to correctly identify the base cgroups hierarchy: > {code} > [ RUN ] MemoryPressureMesosTest.CGROUPS_ROOT_Statistics > ../../src/tests
[jira] [Created] (MESOS-4053) MemoryPressureMesosTest tests fail on CentOS 6.6
Greg Mann created MESOS-4053: Summary: MemoryPressureMesosTest tests fail on CentOS 6.6 Key: MESOS-4053 URL: https://issues.apache.org/jira/browse/MESOS-4053 Project: Mesos Issue Type: Bug Environment: CentOS 6.6 Reporter: Greg Mann {{MemoryPressureMesosTest.CGROUPS_ROOT_Statistics}} and {{MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery}} fail on CentOS 6.6. It seems the tests fail to correctly identify the base cgroups hierarchy: {code} [ RUN ] CgroupsAnyHierarchyMemoryPressureTest.ROOT_IncreasePageCache [ OK ] CgroupsAnyHierarchyMemoryPressureTest.ROOT_IncreasePageCache (2091 ms) ../../src/tests/containerizer/cgroups_tests.cpp:84: Failure (cgroups::cleanup(TEST_CGROUPS_HIERARCHY)).failure(): Operation not permitted [--] 2 tests from CgroupsAnyHierarchyMemoryPressureTest (2109 ms total) [--] 2 tests from MemoryPressureMesosTest 1+0 records in 1+0 records out 1048576 bytes (1.0 MB) copied, 0.0011065 s, 948 MB/s [ RUN ] MemoryPressureMesosTest.CGROUPS_ROOT_Statistics ../../src/tests/mesos.cpp:849: Failure Value of: _baseHierarchy.get() Actual: "/cgroup" Expected: baseHierarchy Which is: "/tmp/mesos_test_cgroup" - Multiple cgroups base hierarchies detected: '/tmp/mesos_test_cgroup' '/cgroup' Mesos does not support multiple cgroups base hierarchies. Please unmount the corresponding (or all) subsystems. - ../../src/tests/mesos.cpp:932: Failure (cgroups::destroy(hierarchy, cgroup)).failure(): Failed to remove cgroup '/tmp/mesos_test_cgroup/perf_event/mesos_test': Device or resource busy [ FAILED ] MemoryPressureMesosTest.CGROUPS_ROOT_Statistics (12 ms) [ RUN ] MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery ../../src/tests/mesos.cpp:849: Failure Value of: _baseHierarchy.get() Actual: "/cgroup" Expected: baseHierarchy Which is: "/tmp/mesos_test_cgroup" - Multiple cgroups base hierarchies detected: '/tmp/mesos_test_cgroup' '/cgroup' Mesos does not support multiple cgroups base hierarchies. Please unmount the corresponding (or all) subsystems. - ../../src/tests/mesos.cpp:932: Failure (cgroups::destroy(hierarchy, cgroup)).failure(): Failed to remove cgroup '/tmp/mesos_test_cgroup/perf_event/mesos_test': Device or resource busy [ FAILED ] MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery (7 ms) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-4052) Simple hook implementation proxying out to another daemon process
Zhitao Li created MESOS-4052: Summary: Simple hook implementation proxying out to another daemon process Key: MESOS-4052 URL: https://issues.apache.org/jira/browse/MESOS-4052 Project: Mesos Issue Type: Wish Components: modules Reporter: Zhitao Li Priority: Minor Right now, if Mesos user needs hooks like slavePreLaunchDockerHook, they would need to maintain the compiling, building and packaging of a dynamically linked library in c++ in house. Designs like [Docker's Volume plugin|https://docs.docker.com/engine/extend/plugins_volume/] simply requires user to implement a predefined REST API in any language and listen at a domain socket. This would be more flexible for companies that does not use c++ as primary language. This ticket is exploring the possibility of whether Mesos could provide a default module that 1) defines such API and 2) proxies out to the external agent for any heavy lifting. I'm more than happy to work on this than maintain this hook in house in the longer term. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4052) Simple hook implementation proxying out to another daemon process
[ https://issues.apache.org/jira/browse/MESOS-4052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhitao Li updated MESOS-4052: - Description: Right now, if Mesos user needs hooks like slavePreLaunchDockerHook, they would need to maintain the compiling, building and packaging of a dynamically linked library in c++ in house. Designs like [Docker's Volume plugin|https://docs.docker.com/engine/extend/plugins_volume/] simply requires user to implement a predefined REST API in any language and listen at a domain socket. This would be more flexible for companies that does not use c++ as primary language. This ticket is exploring the possibility of whether Mesos could provide a default module that 1) defines such API and 2) proxies out to the external agent for any heavy lifting. Please let me know whether you think is seems like a reasonable feature/requirement. I'm more than happy to work on this than maintain this hook in house in the longer term. was: Right now, if Mesos user needs hooks like slavePreLaunchDockerHook, they would need to maintain the compiling, building and packaging of a dynamically linked library in c++ in house. Designs like [Docker's Volume plugin|https://docs.docker.com/engine/extend/plugins_volume/] simply requires user to implement a predefined REST API in any language and listen at a domain socket. This would be more flexible for companies that does not use c++ as primary language. This ticket is exploring the possibility of whether Mesos could provide a default module that 1) defines such API and 2) proxies out to the external agent for any heavy lifting. I'm more than happy to work on this than maintain this hook in house in the longer term. > Simple hook implementation proxying out to another daemon process > - > > Key: MESOS-4052 > URL: https://issues.apache.org/jira/browse/MESOS-4052 > Project: Mesos > Issue Type: Wish > Components: modules >Reporter: Zhitao Li >Priority: Minor > > Right now, if Mesos user needs hooks like slavePreLaunchDockerHook, they > would need to maintain the compiling, building and packaging of a dynamically > linked library in c++ in house. > Designs like [Docker's Volume > plugin|https://docs.docker.com/engine/extend/plugins_volume/] simply requires > user to implement a predefined REST API in any language and listen at a > domain socket. This would be more flexible for companies that does not use > c++ as primary language. > This ticket is exploring the possibility of whether Mesos could provide a > default module that 1) defines such API and 2) proxies out to the external > agent for any heavy lifting. > Please let me know whether you think is seems like a reasonable > feature/requirement. > I'm more than happy to work on this than maintain this hook in house in the > longer term. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-4051) Support passing docker image env and user env to docker containerizer
Gilbert Song created MESOS-4051: --- Summary: Support passing docker image env and user env to docker containerizer Key: MESOS-4051 URL: https://issues.apache.org/jira/browse/MESOS-4051 Project: Mesos Issue Type: Improvement Components: containerization, docker Reporter: Gilbert Song Assignee: Gilbert Song Currently we only pass slave env to docker containerizer when docker containerizer launches executor container with docker run. We should be able to support passing docker image env and user taskInfo env to docker containerizer, with the following priority: 1. User taskInfo Env (specified in commandInfo). 2. Docker image Env. 3. Mesos slave Env. We are following this priority to merge. If any is duplicated, overwrite all defined env variables depending on the order above. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4049) Allow user to control behavior of partitioned agents/tasks
[ https://issues.apache.org/jira/browse/MESOS-4049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15036599#comment-15036599 ] Vinod Kone commented on MESOS-4049: --- +100 > Allow user to control behavior of partitioned agents/tasks > -- > > Key: MESOS-4049 > URL: https://issues.apache.org/jira/browse/MESOS-4049 > Project: Mesos > Issue Type: Improvement > Components: master, slave >Reporter: Neil Conway > Labels: mesosphere > > At present, if an agent is partitioned away from the master, the master waits > for a period of time (see MESOS-4048) before deciding that the agent is dead. > Then it marks the agent as lost, sends {{TASK_LOST}} messages for all the > tasks running on the agent, and instructs the agent to shutdown. > Although this behavior is desirable for some/many users, it is not ideal for > everyone. For example: > * Some users might want to aggressively start a new replacement task (e.g., > after one or two ping timeouts are missed); then when the old copy of the > task comes back, they might want to make an intelligent decision about how to > reconcile this situation (e.g., kill old, kill new, allow both to continue > running). > * Some frameworks might want different behavior from other frameworks, or to > treat some tasks differently from other tasks. For example, if a task has a > huge amount of state that would need to be regenerated to spin up another > instance, the user might want to wait longer before starting a new task to > increase the chance that the old task will reappear. > To do this, we'd need to change task state so that a task can go from > {{RUNNING}} to a new state (say {{UNKNOWN}} or {{WANDERING}}), and then from > that state back to {{RUNNING}} (or perhaps we could keep the current > "mark-lost-after-timeout" behavior as an option, in which case {{UNKNOWN}} > could also transition to {{LOST}}). The agent would also keep its old > {{slaveId}} when it reconnects. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-3977) http::_operation() creates unnecessary filter, rescinds unnecessarily
[ https://issues.apache.org/jira/browse/MESOS-3977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neil Conway updated MESOS-3977: --- Description: This function is used by the /reserve, /unreserve, /create-volumes, and /destroy-volumes endpoints. It has a few worts: 1. It installs a 5-second filter when rescinding an offer. However, the cluster state might change so that the filter is actually undesirable. For example, this scenario: * Create DR, make offer * Create PV => rescinds previous offer, sets filter, makes offer * Destroy PV => rescinds previous offer After the last step, we'll wait 5 seconds for the filter to expire before re-offering the DR. 2. If there are sufficient available resources at the target slave, we don't actually need to rescind any offers in the first place. However, _operation() rescinds offers unconditionally. was: This function is used by the /reserve, /unreserve, /create-volume, and /destroy-volume endpoints. It has a few worts: 1. It installs a 5-second filter when rescinding an offer. However, the cluster state might change so that the filter is actually undesirable. For example, this scenario: * Create DR, make offer * Create PV => rescinds previous offer, sets filter, makes offer * Destroy PV => rescinds previous offer After the last step, we'll wait 5 seconds for the filter to expire before re-offering the DR. 2. If there are sufficient available resources at the target slave, we don't actually need to rescind any offers in the first place. However, _operation() rescinds offers unconditionally. > http::_operation() creates unnecessary filter, rescinds unnecessarily > - > > Key: MESOS-3977 > URL: https://issues.apache.org/jira/browse/MESOS-3977 > Project: Mesos > Issue Type: Bug >Reporter: Neil Conway >Priority: Minor > Labels: mesosphere, reservations > > This function is used by the /reserve, /unreserve, /create-volumes, and > /destroy-volumes endpoints. It has a few worts: > 1. It installs a 5-second filter when rescinding an offer. However, the > cluster state might change so that the filter is actually undesirable. For > example, this scenario: > * Create DR, make offer > * Create PV => rescinds previous offer, sets filter, makes offer > * Destroy PV => rescinds previous offer > After the last step, we'll wait 5 seconds for the filter to expire before > re-offering the DR. > 2. If there are sufficient available resources at the target slave, we don't > actually need to rescind any offers in the first place. However, _operation() > rescinds offers unconditionally. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-3987) /create-volumes, /destroy-volumes should be permissive under a master without authentication.
[ https://issues.apache.org/jira/browse/MESOS-3987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neil Conway updated MESOS-3987: --- Summary: /create-volumes, /destroy-volumes should be permissive under a master without authentication. (was: /create-volume, /destroy-volume should be permissive under a master without authentication.) > /create-volumes, /destroy-volumes should be permissive under a master without > authentication. > - > > Key: MESOS-3987 > URL: https://issues.apache.org/jira/browse/MESOS-3987 > Project: Mesos > Issue Type: Bug >Reporter: Neil Conway > Labels: authentication, mesosphere, persistent-volumes > > See MESOS-3940 for details. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-4050) Change task reconciliation not omit unknown tasks
Neil Conway created MESOS-4050: -- Summary: Change task reconciliation not omit unknown tasks Key: MESOS-4050 URL: https://issues.apache.org/jira/browse/MESOS-4050 Project: Mesos Issue Type: Improvement Components: framework, master Reporter: Neil Conway If a framework tries to reconcile the state of a task that is in an unknown state (because the agent running the task is partitioned from the master), the master will _not_ include any information about that task. This is confusing for framework authors. It seems better for the master to announce all the information it has explicitly: e.g., to return "task X is in an unknown state", rather than not returning anything. Then as more information arrives (e.g., task returns or task definitively dies), task state would transition appropriately. This might be consistent with changing the task states so that we capture "task is partitioned" as an explicit task state ({{TASK_UNKNOWN}} or {{TASK_WANDERING}}) -- see MESOS-4049. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-4049) Allow user to control behavior of partitioned agents/tasks
Neil Conway created MESOS-4049: -- Summary: Allow user to control behavior of partitioned agents/tasks Key: MESOS-4049 URL: https://issues.apache.org/jira/browse/MESOS-4049 Project: Mesos Issue Type: Improvement Components: master, slave Reporter: Neil Conway At present, if an agent is partitioned away from the master, the master waits for a period of time (see MESOS-4048) before deciding that the agent is dead. Then it marks the agent as lost, sends {{TASK_LOST}} messages for all the tasks running on the agent, and instructs the agent to shutdown. Although this behavior is desirable for some/many users, it is not ideal for everyone. For example: * Some users might want to aggressively start a new replacement task (e.g., after one or two ping timeouts are missed); then when the old copy of the task comes back, they might want to make an intelligent decision about how to reconcile this situation (e.g., kill old, kill new, allow both to continue running). * Some frameworks might want different behavior from other frameworks, or to treat some tasks differently from other tasks. For example, if a task has a huge amount of state that would need to be regenerated to spin up another instance, the user might want to wait longer before starting a new task to increase the chance that the old task will reappear. To do this, we'd need to change task state so that a task can go from {{RUNNING}} to a new state (say {{UNKNOWN}} or {{WANDERING}}), and then from that state back to {{RUNNING}} (or perhaps we could keep the current "mark-lost-after-timeout" behavior as an option, in which case {{UNKNOWN}} could also transition to {{LOST}}). The agent would also keep its old {{slaveId}} when it reconnects. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4045) NumifyTest.HexNumberTest fails
[ https://issues.apache.org/jira/browse/MESOS-4045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Park updated MESOS-4045: Environment: Mac OS X 10.11.1 > NumifyTest.HexNumberTest fails > -- > > Key: MESOS-4045 > URL: https://issues.apache.org/jira/browse/MESOS-4045 > Project: Mesos > Issue Type: Bug > Components: stout > Environment: Mac OS X 10.11.1 >Reporter: Michael Park > > {noformat} > [==] Running 1 test from 1 test case. > [--] Global test environment set-up. > [--] 1 test from NumifyTest > [ RUN ] NumifyTest.HexNumberTest > ../../../../3rdparty/libprocess/3rdparty/stout/tests/numify_tests.cpp:44: > Failure > Value of: numify("0x10.9").isError() > Actual: false > Expected: true > [ FAILED ] NumifyTest.HexNumberTest (0 ms) > [--] 1 test from NumifyTest (0 ms total) > [--] Global test environment tear-down > [==] 1 test from 1 test case ran. (0 ms total) > [ PASSED ] 0 tests. > [ FAILED ] 1 test, listed below: > [ FAILED ] NumifyTest.HexNumberTest > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-4048) Consider unifying slave timeout behavior between steady state and master failover
Neil Conway created MESOS-4048: -- Summary: Consider unifying slave timeout behavior between steady state and master failover Key: MESOS-4048 URL: https://issues.apache.org/jira/browse/MESOS-4048 Project: Mesos Issue Type: Improvement Components: master, slave Reporter: Neil Conway Priority: Minor Currently, there are two timeouts that control what happens when an agent is partitioned from the master: 1. {{max_slave_ping_timeouts}} + {{slave_ping_timeout}} controls how long the master waits before declaring a slave to be dead in the "steady state" 2. {{slave_reregister_timeout}} controls how long the master waits for a slave to reregister after master failover. It is unclear whether these two cases really merit being treated differently -- it might be simpler for operators to configure a single timeout that controls how long the master waits before declaring that a slave is dead. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (MESOS-4047) MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery is flaky
[ https://issues.apache.org/jira/browse/MESOS-4047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph Wu reassigned MESOS-4047: Assignee: Joseph Wu > MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery is flaky > --- > > Key: MESOS-4047 > URL: https://issues.apache.org/jira/browse/MESOS-4047 > Project: Mesos > Issue Type: Bug > Components: test >Affects Versions: 0.26.0 > Environment: Ubuntu 14, gcc 4.8.4 >Reporter: Joseph Wu >Assignee: Joseph Wu > Labels: flaky, flaky-test > > {code:title=Output from passed test} > [--] 1 test from MemoryPressureMesosTest > 1+0 records in > 1+0 records out > 1048576 bytes (1.0 MB) copied, 0.000430889 s, 2.4 GB/s > [ RUN ] MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery > I1202 11:09:14.319327 5062 exec.cpp:134] Version: 0.27.0 > I1202 11:09:14.17 5079 exec.cpp:208] Executor registered on slave > bea15b35-9aa1-4b57-96fb-29b5f70638ac-S0 > Registered executor on ubuntu > Starting task 4e62294c-cfcf-4a13-b699-c6a4b7ac5162 > sh -c 'while true; do dd count=512 bs=1M if=/dev/zero of=./temp; done' > Forked command at 5085 > I1202 11:09:14.391739 5077 exec.cpp:254] Received reconnect request from > slave bea15b35-9aa1-4b57-96fb-29b5f70638ac-S0 > I1202 11:09:14.398598 5082 exec.cpp:231] Executor re-registered on slave > bea15b35-9aa1-4b57-96fb-29b5f70638ac-S0 > Re-registered executor on ubuntu > Shutting down > Sending SIGTERM to process tree at pid 5085 > Killing the following process trees: > [ > -+- 5085 sh -c while true; do dd count=512 bs=1M if=/dev/zero of=./temp; done > \--- 5086 dd count=512 bs=1M if=/dev/zero of=./temp > ] > [ OK ] MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery (1096 ms) > {code} > {code:title=Output from failed test} > [--] 1 test from MemoryPressureMesosTest > 1+0 records in > 1+0 records out > 1048576 bytes (1.0 MB) copied, 0.000404489 s, 2.6 GB/s > [ RUN ] MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery > I1202 11:09:15.509950 5109 exec.cpp:134] Version: 0.27.0 > I1202 11:09:15.568183 5123 exec.cpp:208] Executor registered on slave > 88734acc-718e-45b0-95b9-d8f07cea8a9e-S0 > Registered executor on ubuntu > Starting task 14b6bab9-9f60-4130-bdc4-44efba262bc6 > Forked command at 5132 > sh -c 'while true; do dd count=512 bs=1M if=/dev/zero of=./temp; done' > I1202 11:09:15.665498 5129 exec.cpp:254] Received reconnect request from > slave 88734acc-718e-45b0-95b9-d8f07cea8a9e-S0 > I1202 11:09:15.670995 5123 exec.cpp:381] Executor asked to shutdown > Shutting down > Sending SIGTERM to process tree at pid 5132 > ../../src/tests/containerizer/memory_pressure_tests.cpp:283: Failure > (usage).failure(): Unknown container: ebe90e15-72fa-4519-837b-62f43052c913 > *** Aborted at 1449083355 (unix time) try "date -d @1449083355" if you are > using GNU date *** > {code} > Notice that in the failed test, the executor is asked to shutdown when it > tries to reconnect to the agent. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4047) MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery is flaky
[ https://issues.apache.org/jira/browse/MESOS-4047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15036465#comment-15036465 ] Joseph Wu commented on MESOS-4047: -- Note: {{MesosContainerizerSlaveRecoveryTest.ResourceStatistics}} has similar logic for restarting the agent, re-registering an executor, and [calling {{MesosContainerizer::usage}}|https://github.com/apache/mesos/blob/master/src/tests/slave_recovery_tests.cpp#L3267]. But this test is stable. The flaky test waits on: {code} Future _recover = FUTURE_DISPATCH(_, &Slave::_recover); Future slaveReregisteredMessage = FUTURE_PROTOBUF(SlaveReregisteredMessage(), _, _); {code} Whereas the stable test waits on: {code} // Set up so we can wait until the new slave updates the container's // resources (this occurs after the executor has re-registered). Future update = FUTURE_DISPATCH(_, &MesosContainerizerProcess::update); {code} > MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery is flaky > --- > > Key: MESOS-4047 > URL: https://issues.apache.org/jira/browse/MESOS-4047 > Project: Mesos > Issue Type: Bug > Components: test >Affects Versions: 0.26.0 > Environment: Ubuntu 14, gcc 4.8.4 >Reporter: Joseph Wu > Labels: flaky, flaky-test > > {code:title=Output from passed test} > [--] 1 test from MemoryPressureMesosTest > 1+0 records in > 1+0 records out > 1048576 bytes (1.0 MB) copied, 0.000430889 s, 2.4 GB/s > [ RUN ] MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery > I1202 11:09:14.319327 5062 exec.cpp:134] Version: 0.27.0 > I1202 11:09:14.17 5079 exec.cpp:208] Executor registered on slave > bea15b35-9aa1-4b57-96fb-29b5f70638ac-S0 > Registered executor on ubuntu > Starting task 4e62294c-cfcf-4a13-b699-c6a4b7ac5162 > sh -c 'while true; do dd count=512 bs=1M if=/dev/zero of=./temp; done' > Forked command at 5085 > I1202 11:09:14.391739 5077 exec.cpp:254] Received reconnect request from > slave bea15b35-9aa1-4b57-96fb-29b5f70638ac-S0 > I1202 11:09:14.398598 5082 exec.cpp:231] Executor re-registered on slave > bea15b35-9aa1-4b57-96fb-29b5f70638ac-S0 > Re-registered executor on ubuntu > Shutting down > Sending SIGTERM to process tree at pid 5085 > Killing the following process trees: > [ > -+- 5085 sh -c while true; do dd count=512 bs=1M if=/dev/zero of=./temp; done > \--- 5086 dd count=512 bs=1M if=/dev/zero of=./temp > ] > [ OK ] MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery (1096 ms) > {code} > {code:title=Output from failed test} > [--] 1 test from MemoryPressureMesosTest > 1+0 records in > 1+0 records out > 1048576 bytes (1.0 MB) copied, 0.000404489 s, 2.6 GB/s > [ RUN ] MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery > I1202 11:09:15.509950 5109 exec.cpp:134] Version: 0.27.0 > I1202 11:09:15.568183 5123 exec.cpp:208] Executor registered on slave > 88734acc-718e-45b0-95b9-d8f07cea8a9e-S0 > Registered executor on ubuntu > Starting task 14b6bab9-9f60-4130-bdc4-44efba262bc6 > Forked command at 5132 > sh -c 'while true; do dd count=512 bs=1M if=/dev/zero of=./temp; done' > I1202 11:09:15.665498 5129 exec.cpp:254] Received reconnect request from > slave 88734acc-718e-45b0-95b9-d8f07cea8a9e-S0 > I1202 11:09:15.670995 5123 exec.cpp:381] Executor asked to shutdown > Shutting down > Sending SIGTERM to process tree at pid 5132 > ../../src/tests/containerizer/memory_pressure_tests.cpp:283: Failure > (usage).failure(): Unknown container: ebe90e15-72fa-4519-837b-62f43052c913 > *** Aborted at 1449083355 (unix time) try "date -d @1449083355" if you are > using GNU date *** > {code} > Notice that in the failed test, the executor is asked to shutdown when it > tries to reconnect to the agent. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-3586) MemoryPressureMesosTest.CGROUPS_ROOT_Statistics and CGROUPS_ROOT_SlaveRecovery are flaky
[ https://issues.apache.org/jira/browse/MESOS-3586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph Wu updated MESOS-3586: - Description: I am install Mesos 0.24.0 on 4 servers which have very similar hardware and software configurations. After performing {{../configure}}, {{make}}, and {{make check}} some servers have completed successfully and other failed on test {{[ RUN ] MemoryPressureMesosTest.CGROUPS_ROOT_Statistics}}. Is there something I should check in this test? {code} PERFORMED MAKE CHECK NODE-001 [ RUN ] MemoryPressureMesosTest.CGROUPS_ROOT_Statistics I1005 14:37:35.585067 38479 exec.cpp:133] Version: 0.24.0 I1005 14:37:35.593789 38497 exec.cpp:207] Executor registered on slave 20151005-143735-2393768202-35106-27900-S0 Registered executor on svdidac038.techlabs.accenture.com Starting task 010b2fe9-4eac-4136-8a8a-6ce7665488b0 Forked command at 38510 sh -c 'while true; do dd count=512 bs=1M if=/dev/zero of=./temp; done' PERFORMED MAKE CHECK NODE-002 [ RUN ] MemoryPressureMesosTest.CGROUPS_ROOT_Statistics I1005 14:38:58.794112 36997 exec.cpp:133] Version: 0.24.0 I1005 14:38:58.802851 37022 exec.cpp:207] Executor registered on slave 20151005-143857-2360213770-50427-26325-S0 Registered executor on svdidac039.techlabs.accenture.com Starting task 9bb317ba-41cb-44a4-b507-d1c85ceabc28 sh -c 'while true; do dd count=512 bs=1M if=/dev/zero of=./temp; done' Forked command at 37028 ../../src/tests/containerizer/memory_pressure_tests.cpp:145: Failure Expected: (usage.get().mem_medium_pressure_counter()) >= (usage.get().mem_critical_pressure_counter()), actual: 5 vs 6 2015-10-05 14:39:00,130:26325(0x2af08cc78700):ZOO_ERROR@handle_socket_error_msg@1697: Socket [127.0.0.1:37198] zk retcode=-4, errno=111(Connection refused): server refused to accept the client [ FAILED ] MemoryPressureMesosTest.CGROUPS_ROOT_Statistics (4303 ms) {code} was: I am install Mesos 0.24.0 on 4 servers which have very similar hardware and software configurations. After performing ../configure, make, and make check some servers have completed successfully and other failed on test [ RUN ] MemoryPressureMesosTest.CGROUPS_ROOT_Statistics. Is there something I should check in this test? PERFORMED MAKE CHECK NODE-001 [ RUN ] MemoryPressureMesosTest.CGROUPS_ROOT_Statistics I1005 14:37:35.585067 38479 exec.cpp:133] Version: 0.24.0 I1005 14:37:35.593789 38497 exec.cpp:207] Executor registered on slave 20151005-143735-2393768202-35106-27900-S0 Registered executor on svdidac038.techlabs.accenture.com Starting task 010b2fe9-4eac-4136-8a8a-6ce7665488b0 Forked command at 38510 sh -c 'while true; do dd count=512 bs=1M if=/dev/zero of=./temp; done' PERFORMED MAKE CHECK NODE-002 [ RUN ] MemoryPressureMesosTest.CGROUPS_ROOT_Statistics I1005 14:38:58.794112 36997 exec.cpp:133] Version: 0.24.0 I1005 14:38:58.802851 37022 exec.cpp:207] Executor registered on slave 20151005-143857-2360213770-50427-26325-S0 Registered executor on svdidac039.techlabs.accenture.com Starting task 9bb317ba-41cb-44a4-b507-d1c85ceabc28 sh -c 'while true; do dd count=512 bs=1M if=/dev/zero of=./temp; done' Forked command at 37028 ../../src/tests/containerizer/memory_pressure_tests.cpp:145: Failure Expected: (usage.get().mem_medium_pressure_counter()) >= (usage.get().mem_critical_pressure_counter()), actual: 5 vs 6 2015-10-05 14:39:00,130:26325(0x2af08cc78700):ZOO_ERROR@handle_socket_error_msg@1697: Socket [127.0.0.1:37198] zk retcode=-4, errno=111(Connection refused): server refused to accept the client [ FAILED ] MemoryPressureMesosTest.CGROUPS_ROOT_Statistics (4303 ms) > MemoryPressureMesosTest.CGROUPS_ROOT_Statistics and > CGROUPS_ROOT_SlaveRecovery are flaky > > > Key: MESOS-3586 > URL: https://issues.apache.org/jira/browse/MESOS-3586 > Project: Mesos > Issue Type: Bug > Components: test >Affects Versions: 0.24.0, 0.26.0 > Environment: Ubuntu 14.04, 3.13.0-32 generic > Debian 8, gcc 4.9.2 >Reporter: Miguel Bernadin >Assignee: Joseph Wu > Labels: flaky, flaky-test > > I am install Mesos 0.24.0 on 4 servers which have very similar hardware and > software configurations. > After performing {{../configure}}, {{make}}, and {{make check}} some servers > have completed successfully and other failed on test {{[ RUN ] > MemoryPressureMesosTest.CGROUPS_ROOT_Statistics}}. > Is there something I should check in this test? > {code} > PERFORMED MAKE CHECK NODE-001 > [ RUN ] MemoryPressureMesosTest.CGROUPS_ROOT_Statistics > I1005 14:37:35.585067 38479 exec.cpp:133] Version: 0.24.0 > I1005 14:37:35.593789 38497 exec.cpp:207] Executor registered on slave > 20151005-143735-2393768202-35106-27900-S0 > Registered executor on svdidac038.techlabs.ac
[jira] [Created] (MESOS-4047) MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery is flaky
Joseph Wu created MESOS-4047: Summary: MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery is flaky Key: MESOS-4047 URL: https://issues.apache.org/jira/browse/MESOS-4047 Project: Mesos Issue Type: Bug Components: test Affects Versions: 0.26.0 Environment: Ubuntu 14, gcc 4.8.4 Reporter: Joseph Wu {code:title=Output from passed test} [--] 1 test from MemoryPressureMesosTest 1+0 records in 1+0 records out 1048576 bytes (1.0 MB) copied, 0.000430889 s, 2.4 GB/s [ RUN ] MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery I1202 11:09:14.319327 5062 exec.cpp:134] Version: 0.27.0 I1202 11:09:14.17 5079 exec.cpp:208] Executor registered on slave bea15b35-9aa1-4b57-96fb-29b5f70638ac-S0 Registered executor on ubuntu Starting task 4e62294c-cfcf-4a13-b699-c6a4b7ac5162 sh -c 'while true; do dd count=512 bs=1M if=/dev/zero of=./temp; done' Forked command at 5085 I1202 11:09:14.391739 5077 exec.cpp:254] Received reconnect request from slave bea15b35-9aa1-4b57-96fb-29b5f70638ac-S0 I1202 11:09:14.398598 5082 exec.cpp:231] Executor re-registered on slave bea15b35-9aa1-4b57-96fb-29b5f70638ac-S0 Re-registered executor on ubuntu Shutting down Sending SIGTERM to process tree at pid 5085 Killing the following process trees: [ -+- 5085 sh -c while true; do dd count=512 bs=1M if=/dev/zero of=./temp; done \--- 5086 dd count=512 bs=1M if=/dev/zero of=./temp ] [ OK ] MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery (1096 ms) {code} {code:title=Output from failed test} [--] 1 test from MemoryPressureMesosTest 1+0 records in 1+0 records out 1048576 bytes (1.0 MB) copied, 0.000404489 s, 2.6 GB/s [ RUN ] MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery I1202 11:09:15.509950 5109 exec.cpp:134] Version: 0.27.0 I1202 11:09:15.568183 5123 exec.cpp:208] Executor registered on slave 88734acc-718e-45b0-95b9-d8f07cea8a9e-S0 Registered executor on ubuntu Starting task 14b6bab9-9f60-4130-bdc4-44efba262bc6 Forked command at 5132 sh -c 'while true; do dd count=512 bs=1M if=/dev/zero of=./temp; done' I1202 11:09:15.665498 5129 exec.cpp:254] Received reconnect request from slave 88734acc-718e-45b0-95b9-d8f07cea8a9e-S0 I1202 11:09:15.670995 5123 exec.cpp:381] Executor asked to shutdown Shutting down Sending SIGTERM to process tree at pid 5132 ../../src/tests/containerizer/memory_pressure_tests.cpp:283: Failure (usage).failure(): Unknown container: ebe90e15-72fa-4519-837b-62f43052c913 *** Aborted at 1449083355 (unix time) try "date -d @1449083355" if you are using GNU date *** {code} Notice that in the failed test, the executor is asked to shutdown when it tries to reconnect to the agent. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3787) As a developer, I'd like to be able to expand environment variables through the Docker executor.
[ https://issues.apache.org/jira/browse/MESOS-3787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15036382#comment-15036382 ] Jojy Varghese commented on MESOS-3787: -- [~adam-mesos] That is a great point. I would think that we would need access checks inside /after the expansion of the env variables. > As a developer, I'd like to be able to expand environment variables through > the Docker executor. > > > Key: MESOS-3787 > URL: https://issues.apache.org/jira/browse/MESOS-3787 > Project: Mesos > Issue Type: Wish >Reporter: John Garcia > Labels: mesosphere > Attachments: mesos.patch, test-example.json > > > We'd like to have expanded variables usable in [the json files used to create > a Marathon app, hence] the Task's CommandInfo, so that the executor is able > to detect the correct values at runtime. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4045) NumifyTest.HexNumberTest fails
[ https://issues.apache.org/jira/browse/MESOS-4045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15036287#comment-15036287 ] Neil Conway commented on MESOS-4045: Repros for me on OSX 10.10 > NumifyTest.HexNumberTest fails > -- > > Key: MESOS-4045 > URL: https://issues.apache.org/jira/browse/MESOS-4045 > Project: Mesos > Issue Type: Bug > Components: stout >Reporter: Michael Park > > {noformat} > [==] Running 1 test from 1 test case. > [--] Global test environment set-up. > [--] 1 test from NumifyTest > [ RUN ] NumifyTest.HexNumberTest > ../../../../3rdparty/libprocess/3rdparty/stout/tests/numify_tests.cpp:44: > Failure > Value of: numify("0x10.9").isError() > Actual: false > Expected: true > [ FAILED ] NumifyTest.HexNumberTest (0 ms) > [--] 1 test from NumifyTest (0 ms total) > [--] Global test environment tear-down > [==] 1 test from 1 test case ran. (0 ms total) > [ PASSED ] 0 tests. > [ FAILED ] 1 test, listed below: > [ FAILED ] NumifyTest.HexNumberTest > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4045) NumifyTest.HexNumberTest fails
[ https://issues.apache.org/jira/browse/MESOS-4045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neil Conway updated MESOS-4045: --- Component/s: stout > NumifyTest.HexNumberTest fails > -- > > Key: MESOS-4045 > URL: https://issues.apache.org/jira/browse/MESOS-4045 > Project: Mesos > Issue Type: Bug > Components: stout >Reporter: Michael Park > > {noformat} > [==] Running 1 test from 1 test case. > [--] Global test environment set-up. > [--] 1 test from NumifyTest > [ RUN ] NumifyTest.HexNumberTest > ../../../../3rdparty/libprocess/3rdparty/stout/tests/numify_tests.cpp:44: > Failure > Value of: numify("0x10.9").isError() > Actual: false > Expected: true > [ FAILED ] NumifyTest.HexNumberTest (0 ms) > [--] 1 test from NumifyTest (0 ms total) > [--] Global test environment tear-down > [==] 1 test from 1 test case ran. (0 ms total) > [ PASSED ] 0 tests. > [ FAILED ] 1 test, listed below: > [ FAILED ] NumifyTest.HexNumberTest > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-4046) Enable `Env` specified in docker image can be returned from docker pull
Gilbert Song created MESOS-4046: --- Summary: Enable `Env` specified in docker image can be returned from docker pull Key: MESOS-4046 URL: https://issues.apache.org/jira/browse/MESOS-4046 Project: Mesos Issue Type: Improvement Components: docker Reporter: Gilbert Song Assignee: Gilbert Song Currently docker pull only return an image structure, which only contains entrypoint info. We have docker inspect as a subprocess inside docker pull, which contains many other useful information of a docker image. We should be able to support returning environment variables information from the image. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-3183) Documentation images do not load
[ https://issues.apache.org/jira/browse/MESOS-3183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph Wu updated MESOS-3183: - Description: Any images which are referenced from the generated docs ({{docs/*.md}}) do not show up on the website. For example: * [Architecture|http://mesos.apache.org/documentation/latest/architecture/] * [External Containerizer|http://mesos.apache.org/documentation/latest/external-containerizer/] * [Fetcher Cache Internals|http://mesos.apache.org/documentation/latest/fetcher-cache-internals/] * [Maintenance|http://mesos.apache.org/documentation/latest/maintenance/] * [Oversubscription|http://mesos.apache.org/documentation/latest/oversubscription/] was: Any images which are referenced from the generated docs ({{docs/*.md}}) do not show up on the website. For example: * [External Containerizer|http://mesos.apache.org/documentation/latest/external-containerizer/] * [Fetcher Cache Internals|http://mesos.apache.org/documentation/latest/fetcher-cache-internals/] * [Maintenance|http://mesos.apache.org/documentation/latest/maintenance/] * [Oversubscription|http://mesos.apache.org/documentation/latest/oversubscription/] > Documentation images do not load > > > Key: MESOS-3183 > URL: https://issues.apache.org/jira/browse/MESOS-3183 > Project: Mesos > Issue Type: Documentation > Components: documentation >Affects Versions: 0.24.0 >Reporter: James Mulcahy >Priority: Minor > Labels: mesosphere > Attachments: rake.patch > > > Any images which are referenced from the generated docs ({{docs/*.md}}) do > not show up on the website. For example: > * [Architecture|http://mesos.apache.org/documentation/latest/architecture/] > * [External > Containerizer|http://mesos.apache.org/documentation/latest/external-containerizer/] > * [Fetcher Cache > Internals|http://mesos.apache.org/documentation/latest/fetcher-cache-internals/] > * [Maintenance|http://mesos.apache.org/documentation/latest/maintenance/] > * > [Oversubscription|http://mesos.apache.org/documentation/latest/oversubscription/] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3493) benchmark for declining offers
[ https://issues.apache.org/jira/browse/MESOS-3493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15036178#comment-15036178 ] James Peach commented on MESOS-3493: thanks! > benchmark for declining offers > -- > > Key: MESOS-3493 > URL: https://issues.apache.org/jira/browse/MESOS-3493 > Project: Mesos > Issue Type: Improvement > Components: test >Reporter: James Peach >Assignee: James Peach >Priority: Minor > Labels: Mesosphere > > I wrote a benchmark that can be used to demonstrate the performance issues > addressed in MESOS-3052, MESOS-3051, MESOS-3157 and MESOS-3075. The benchmark > simulates a number of frameworks that start declining all offers once they > reach the limit of work they need to do. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4045) NumifyTest.HexNumberTest fails
[ https://issues.apache.org/jira/browse/MESOS-4045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15036157#comment-15036157 ] Michael Park commented on MESOS-4045: - [~wangcong] Looks like this one was introduced in https://github.com/apache/mesos/commit/7745eea2a4f4dff6e12f1955baa996a1869af3dc ? > NumifyTest.HexNumberTest fails > -- > > Key: MESOS-4045 > URL: https://issues.apache.org/jira/browse/MESOS-4045 > Project: Mesos > Issue Type: Bug >Reporter: Michael Park > > {noformat} > [==] Running 1 test from 1 test case. > [--] Global test environment set-up. > [--] 1 test from NumifyTest > [ RUN ] NumifyTest.HexNumberTest > ../../../../3rdparty/libprocess/3rdparty/stout/tests/numify_tests.cpp:44: > Failure > Value of: numify("0x10.9").isError() > Actual: false > Expected: true > [ FAILED ] NumifyTest.HexNumberTest (0 ms) > [--] 1 test from NumifyTest (0 ms total) > [--] Global test environment tear-down > [==] 1 test from 1 test case ran. (0 ms total) > [ PASSED ] 0 tests. > [ FAILED ] 1 test, listed below: > [ FAILED ] NumifyTest.HexNumberTest > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-4045) NumifyTest.HexNumberTest fails
Michael Park created MESOS-4045: --- Summary: NumifyTest.HexNumberTest fails Key: MESOS-4045 URL: https://issues.apache.org/jira/browse/MESOS-4045 Project: Mesos Issue Type: Bug Reporter: Michael Park {noformat} [==] Running 1 test from 1 test case. [--] Global test environment set-up. [--] 1 test from NumifyTest [ RUN ] NumifyTest.HexNumberTest ../../../../3rdparty/libprocess/3rdparty/stout/tests/numify_tests.cpp:44: Failure Value of: numify("0x10.9").isError() Actual: false Expected: true [ FAILED ] NumifyTest.HexNumberTest (0 ms) [--] 1 test from NumifyTest (0 ms total) [--] Global test environment tear-down [==] 1 test from 1 test case ran. (0 ms total) [ PASSED ] 0 tests. [ FAILED ] 1 test, listed below: [ FAILED ] NumifyTest.HexNumberTest {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3493) benchmark for declining offers
[ https://issues.apache.org/jira/browse/MESOS-3493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15036132#comment-15036132 ] Joris Van Remoortere commented on MESOS-3493: - [~jamespeach]Let's add this to the sprint for early next week :-) > benchmark for declining offers > -- > > Key: MESOS-3493 > URL: https://issues.apache.org/jira/browse/MESOS-3493 > Project: Mesos > Issue Type: Improvement > Components: test >Reporter: James Peach >Assignee: James Peach >Priority: Minor > Labels: Mesosphere > > I wrote a benchmark that can be used to demonstrate the performance issues > addressed in MESOS-3052, MESOS-3051, MESOS-3157 and MESOS-3075. The benchmark > simulates a number of frameworks that start declining all offers once they > reach the limit of work they need to do. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-3493) benchmark for declining offers
[ https://issues.apache.org/jira/browse/MESOS-3493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van Remoortere updated MESOS-3493: Labels: Mesosphere (was: ) > benchmark for declining offers > -- > > Key: MESOS-3493 > URL: https://issues.apache.org/jira/browse/MESOS-3493 > Project: Mesos > Issue Type: Improvement > Components: test >Reporter: James Peach >Assignee: James Peach >Priority: Minor > Labels: Mesosphere > > I wrote a benchmark that can be used to demonstrate the performance issues > addressed in MESOS-3052, MESOS-3051, MESOS-3157 and MESOS-3075. The benchmark > simulates a number of frameworks that start declining all offers once they > reach the limit of work they need to do. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4026) RegistryClientTest.SimpleRegistryPuller is flaky
[ https://issues.apache.org/jira/browse/MESOS-4026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15036072#comment-15036072 ] Jojy Varghese commented on MESOS-4026: -- https://reviews.apache.org/r/40872/ https://reviews.apache.org/r/40873/ The above patches have been tested 600+ times without the test being failed. > RegistryClientTest.SimpleRegistryPuller is flaky > > > Key: MESOS-4026 > URL: https://issues.apache.org/jira/browse/MESOS-4026 > Project: Mesos > Issue Type: Bug >Reporter: Anand Mazumdar >Assignee: Jojy Varghese > Labels: containerizer, flaky-test, mesosphere > > From ASF CI: > https://builds.apache.org/job/Mesos/1289/COMPILER=gcc,CONFIGURATION=--verbose%20--enable-libevent%20--enable-ssl,OS=centos:7,label_exp=docker%7C%7CHadoop/console > {code} > [ RUN ] RegistryClientTest.SimpleRegistryPuller > I1127 02:51:40.235900 362 registry_client.cpp:511] Response status for url > 'https://localhost:57828/v2/library/busybox/manifests/latest': 401 > Unauthorized > I1127 02:51:40.249766 360 registry_client.cpp:511] Response status for url > 'https://localhost:57828/v2/library/busybox/manifests/latest': 200 OK > I1127 02:51:40.251137 361 registry_puller.cpp:195] Downloading layer > '1ce2e90b0bc7224de3db1f0d646fe8e2c4dd37f1793928287f6074bc451a57ea' for image > 'busybox:latest' > I1127 02:51:40.258514 354 registry_client.cpp:511] Response status for url > 'https://localhost:57828/v2/library/busybox/blobs/sha256:a3ed95caeb02ffe68cdd9fd84406680ae93d633cb16422d00e8a7c22955b46d4': > 307 Temporary Redirect > I1127 02:51:40.264171 367 libevent_ssl_socket.cpp:1023] Socket error: > Connection reset by peer > ../../src/tests/containerizer/provisioner_docker_tests.cpp:1210: Failure > (socket).failure(): Failed accept: connection error: Connection reset by peer > [ FAILED ] RegistryClientTest.SimpleRegistryPuller (349 ms) > {code} > Logs from a previous run that passed: > {code} > [ RUN ] RegistryClientTest.SimpleRegistryPuller > I1126 18:49:05.306396 349 registry_client.cpp:511] Response status for url > 'https://localhost:53492/v2/library/busybox/manifests/latest': 401 > Unauthorized > I1126 18:49:05.321362 347 registry_client.cpp:511] Response status for url > 'https://localhost:53492/v2/library/busybox/manifests/latest': 200 OK > I1126 18:49:05.322720 352 registry_puller.cpp:195] Downloading layer > '1ce2e90b0bc7224de3db1f0d646fe8e2c4dd37f1793928287f6074bc451a57ea' for image > 'busybox:latest' > I1126 18:49:05.331317 350 registry_client.cpp:511] Response status for url > 'https://localhost:53492/v2/library/busybox/blobs/sha256:a3ed95caeb02ffe68cdd9fd84406680ae93d633cb16422d00e8a7c22955b46d4': > 307 Temporary Redirect > I1126 18:49:05.370625 352 registry_client.cpp:511] Response status for url > 'https://127.0.0.1:53492/': 200 OK > I1126 18:49:05.372102 355 registry_puller.cpp:294] Untarring layer > '1ce2e90b0bc7224de3db1f0d646fe8e2c4dd37f1793928287f6074bc451a57ea' downloaded > from registry to directory 'output_dir' > [ OK ] RegistryClientTest.SimpleRegistryPuller (353 ms) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3219) Slave recovery issues with Docker containerizer
[ https://issues.apache.org/jira/browse/MESOS-3219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15036024#comment-15036024 ] Yong Tang commented on MESOS-3219: -- If your systems could assume that the container can always run, then a (partially) workaround is to have a shell script to constantly restart mesos slave in a loop within the container. In this way, the shell script will be served as the foreground process so the container will not die. If mesos slave process itself dies then at least the shell script will restart and recover correctly. That obviously is not a complete solution but it may help in certain situations. > Slave recovery issues with Docker containerizer > --- > > Key: MESOS-3219 > URL: https://issues.apache.org/jira/browse/MESOS-3219 > Project: Mesos > Issue Type: Bug >Reporter: Benjamin Anderson >Assignee: Timothy Chen >Priority: Minor > > I'm working on setting up a Mesos environment with the > Docker containerizer and can't seem to get the recovery feature > working. I'm running CoreOS, so the slave processes themselves are > containerized. I have no issues running jobs without the recovery > features enabled, but all jobs fail to boot when I add the following > flags: > MESOS_DOCKER_KILL_ORPHANS=false > MESOS_DOCKER_MESOS_IMAGE=myrepo/my-slave-container > Inspecting the Docker images and their log output reveals that the > container invocation appears to be flawed - see this gist, which shows the > arguments as retrieved via `docker inspect` as well as the failed container's > log output: > https://gist.github.com/banjiewen/a2dc1784a82ed87edd6b > The containerizer is attempting to invoke an unquoted command via > `/bin/sh -c`, which, predictably, fails to pass the complete command. > This results in the error message shown in the second file in the > linked gist. > This is reproducible manually; quoting the arguments to `/bin/sh -c` > results in success (at least, it correctly receives the supplied > arguments). > The slave container itself is not logging anything of interest. > It's possible that my instance is configured incorrectly as well; the > documentation here is a bit vague and there aren't many examples on the web. > I'm running Mesos 0.23.0 installed via http://repos.mesosphere.io/ in an > Ubuntu 14.04 container. CoreOS is at the latest stable (717.3.0) which gives > a Docker version at about 1.6.2. > I'm happy to provide more details if necessary. Cheers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4044) SlaveRecoveryTest/0.Reboot is flaky
[ https://issues.apache.org/jira/browse/MESOS-4044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bernd Mathiske updated MESOS-4044: -- Labels: cgroups flaky-test mesosphere (was: mesosphere) > SlaveRecoveryTest/0.Reboot is flaky > --- > > Key: MESOS-4044 > URL: https://issues.apache.org/jira/browse/MESOS-4044 > Project: Mesos > Issue Type: Bug > Components: slave > Environment: Debian 8 on VirtualBox > {{configure --enable-debug --enable-ssl --enable-libevent}} >Reporter: Alexander Rojas > Labels: cgroups, flaky-test, mesosphere > > Running the test program as: > {code} > sudo src/mesos-tests --gtest_filter="SlaveRecoveryTest/0.Reboot" > --gtest_repeat=100 --verbose --gtest_break_on_failure > {code} > ends up every time at some point with the failure: > {noformat} > [ RUN ] SlaveRecoveryTest/0.Reboot > I1202 15:18:00.036594 26328 leveldb.cpp:176] Opened db in 12.924775ms > I1202 15:18:00.037643 26328 leveldb.cpp:183] Compacted db in 980477ns > I1202 15:18:00.037693 26328 leveldb.cpp:198] Created db iterator in 15079ns > I1202 15:18:00.037706 26328 leveldb.cpp:204] Seeked to beginning of db in > 1356ns > I1202 15:18:00.037716 26328 leveldb.cpp:273] Iterated through 0 keys in the > db in 313ns > I1202 15:18:00.037753 26328 replica.cpp:780] Replica recovered with log > positions 0 -> 0 with 1 holes and 0 unlearned > I1202 15:18:00.038360 26346 recover.cpp:449] Starting replica recovery > I1202 15:18:00.040987 26346 master.cpp:367] Master > baeb70c6-c960-4d0d-9dc7-48b9a54ef8a8 (debian-vm.localdomain) started on > 127.0.1.1:33625 > I1202 15:18:00.040998 26346 master.cpp:369] Flags at startup: --acls="" > --allocation_interval="1secs" --allocator="HierarchicalDRF" > --authenticate="true" --authenticate_slaves="true" --authenticators="crammd5" > --authorizers="local" --credentials="/tmp/xt1N2F/credentials" > --framework_sorter="drf" --help="false" --hostname_lookup="true" > --initialize_driver_logging="true" --log_auto_initialize="true" > --logbufsecs="0" --logging_level="INFO" --max_slave_ping_timeouts="5" > --quiet="false" --recovery_slave_removal_limit="100%" > --registry="replicated_log" --registry_fetch_timeout="1mins" > --registry_store_timeout="25secs" --registry_strict="true" > --root_submissions="true" --slave_ping_timeout="15secs" > --slave_reregister_timeout="10mins" --user_sorter="drf" --version="false" > --webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/xt1N2F/master" > --zk_session_timeout="10secs" > I1202 15:18:00.041157 26346 master.cpp:414] Master only allowing > authenticated frameworks to register > I1202 15:18:00.041163 26346 master.cpp:419] Master only allowing > authenticated slaves to register > I1202 15:18:00.041168 26346 credentials.hpp:37] Loading credentials for > authentication from '/tmp/xt1N2F/credentials' > I1202 15:18:00.041410 26346 master.cpp:458] Using default 'crammd5' > authenticator > I1202 15:18:00.041524 26346 master.cpp:495] Authorization enabled > I1202 15:18:00.042917 26343 recover.cpp:475] Replica is in EMPTY status > I1202 15:18:00.043557 26343 master.cpp:1606] The newly elected leader is > master@127.0.1.1:33625 with id baeb70c6-c960-4d0d-9dc7-48b9a54ef8a8 > I1202 15:18:00.043577 26343 master.cpp:1619] Elected as the leading master! > I1202 15:18:00.043589 26343 master.cpp:1379] Recovering from registrar > I1202 15:18:00.043766 26343 registrar.cpp:309] Recovering registrar > I1202 15:18:00.044668 26344 replica.cpp:676] Replica in EMPTY status received > a broadcasted recover request from (21064)@127.0.1.1:33625 > I1202 15:18:00.045027 26349 recover.cpp:195] Received a recover response from > a replica in EMPTY status > I1202 15:18:00.045497 26349 recover.cpp:566] Updating replica status to > STARTING > I1202 15:18:00.055539 26349 leveldb.cpp:306] Persisting metadata (8 bytes) to > leveldb took 9.859161ms > I1202 15:18:00.055599 26349 replica.cpp:323] Persisted replica status to > STARTING > I1202 15:18:00.055958 26346 recover.cpp:475] Replica is in STARTING status > I1202 15:18:00.057106 26342 replica.cpp:676] Replica in STARTING status > received a broadcasted recover request from (21065)@127.0.1.1:33625 > I1202 15:18:00.057462 26343 recover.cpp:195] Received a recover response from > a replica in STARTING status > I1202 15:18:00.057886 26347 recover.cpp:566] Updating replica status to VOTING > I1202 15:18:00.058706 26345 leveldb.cpp:306] Persisting metadata (8 bytes) to > leveldb took 634303ns > I1202 15:18:00.058724 26345 replica.cpp:323] Persisted replica status to > VOTING > I1202 15:18:00.058821 26345 recover.cpp:580] Successfully joined the Paxos > group > I1202 15:18:00.058980 26345 recover.cpp:464] Recover process terminated > I1202 15:18:00.059288 26348 log.cpp:661] Attempting to start the writer > I1202 15:18:00.0603
[jira] [Updated] (MESOS-4043) CgroupsAnyHierarchyWithFreezerTest.ROOT_CGROUPS_DestroyTracedProcess is Flaky
[ https://issues.apache.org/jira/browse/MESOS-4043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bernd Mathiske updated MESOS-4043: -- Labels: cgroups flaky-test (was: ) > CgroupsAnyHierarchyWithFreezerTest.ROOT_CGROUPS_DestroyTracedProcess is Flaky > - > > Key: MESOS-4043 > URL: https://issues.apache.org/jira/browse/MESOS-4043 > Project: Mesos > Issue Type: Bug > Components: isolation > Environment: Debian 8 (In virutal machine). > Build with: > {{configure --enable-ssl --enable-libevent --enable-debug}} >Reporter: Alexander Rojas >Assignee: Timothy Chen > Labels: cgroups, flaky-test > > Running the test with > {code} > sudo src/mesos-tests --gtest_repeat=100 --verbose --gtest_break_on_failur > {code} > yielded at least once: > {noformat} > [ RUN ] > CgroupsAnyHierarchyWithFreezerTest.ROOT_CGROUPS_DestroyTracedProcess > I1202 14:59:40.530966 2564 cgroups.cpp:2429] Freezing cgroup > /sys/fs/cgroup/freezer/mesos_test > I1202 14:59:40.546022 2566 cgroups.cpp:1411] Successfully froze cgroup > /sys/fs/cgroup/freezer/mesos_test after 14.974976ms > I1202 14:59:40.560233 2566 cgroups.cpp:2447] Thawing cgroup > /sys/fs/cgroup/freezer/mesos_test > I1202 14:59:40.574983 2570 cgroups.cpp:1440] Successfullly thawed cgroup > /sys/fs/cgroup/freezer/mesos_test after 14.671104ms > ../../src/tests/containerizer/cgroups_tests.cpp:939: Failure > Value of: ::waitpid(pid, &status, 0) > Actual: 26319 > Expected: -1 > *** Aborted at 1449064780 (unix time) try "date -d @1449064780" if you are > using GNU date *** > PC: @ 0x14b07ae testing::UnitTest::AddTestPartResult() > *** SIGSEGV (@0x0) received by PID 2549 (TID 0x7f0017c287c0) from PID 0; > stack trace: *** > @ 0x7e9b866c os::Linux::chained_handler() > @ 0x7e9bca0a JVM_handle_linux_signal > @ 0x7f00115458d0 (unknown) > @ 0x14b07ae testing::UnitTest::AddTestPartResult() > @ 0x14a51e7 testing::internal::AssertHelper::operator=() > @ 0x14129f4 > mesos::internal::tests::CgroupsAnyHierarchyWithFreezerTest_ROOT_CGROUPS_DestroyTracedProcess_Test::TestBody() > @ 0x14ce2d0 > testing::internal::HandleSehExceptionsInMethodIfSupported<>() > @ 0x14c9248 > testing::internal::HandleExceptionsInMethodIfSupported<>() > @ 0x14aa587 testing::Test::Run() > @ 0x14aad15 testing::TestInfo::Run() > @ 0x14ab350 testing::TestCase::Run() > @ 0x14b1c9f testing::internal::UnitTestImpl::RunAllTests() > @ 0x14cef5f > testing::internal::HandleSehExceptionsInMethodIfSupported<>() > @ 0x14c9d9e > testing::internal::HandleExceptionsInMethodIfSupported<>() > @ 0x14b09cf testing::UnitTest::Run() > @ 0xd63e02 RUN_ALL_TESTS() > @ 0xd639e0 main > @ 0x7f00111aeb45 (unknown) > @ 0x9588f9 (unknown) > {noformat} > However running: > {code} > sudo src/mesos-tests > --gtest_filter="CgroupsAnyHierarchyWithFreezerTest.ROOT_CGROUPS_DestroyTracedProcess" > --gtest_repeat=1000 --verbose --gtest_break_on_failure > {code} > Doesn't reproduce the error. It may be cause by a state left by a previous > test. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-4044) SlaveRecoveryTest/0.Reboot is flaky
Alexander Rojas created MESOS-4044: -- Summary: SlaveRecoveryTest/0.Reboot is flaky Key: MESOS-4044 URL: https://issues.apache.org/jira/browse/MESOS-4044 Project: Mesos Issue Type: Bug Components: slave Environment: Debian 8 on VirtualBox {{configure --enable-debug --enable-ssl --enable-libevent}} Reporter: Alexander Rojas Running the test program as: {code} sudo src/mesos-tests --gtest_filter="SlaveRecoveryTest/0.Reboot" --gtest_repeat=100 --verbose --gtest_break_on_failure {code} ends up every time at some point with the failure: {noformat} [ RUN ] SlaveRecoveryTest/0.Reboot I1202 15:18:00.036594 26328 leveldb.cpp:176] Opened db in 12.924775ms I1202 15:18:00.037643 26328 leveldb.cpp:183] Compacted db in 980477ns I1202 15:18:00.037693 26328 leveldb.cpp:198] Created db iterator in 15079ns I1202 15:18:00.037706 26328 leveldb.cpp:204] Seeked to beginning of db in 1356ns I1202 15:18:00.037716 26328 leveldb.cpp:273] Iterated through 0 keys in the db in 313ns I1202 15:18:00.037753 26328 replica.cpp:780] Replica recovered with log positions 0 -> 0 with 1 holes and 0 unlearned I1202 15:18:00.038360 26346 recover.cpp:449] Starting replica recovery I1202 15:18:00.040987 26346 master.cpp:367] Master baeb70c6-c960-4d0d-9dc7-48b9a54ef8a8 (debian-vm.localdomain) started on 127.0.1.1:33625 I1202 15:18:00.040998 26346 master.cpp:369] Flags at startup: --acls="" --allocation_interval="1secs" --allocator="HierarchicalDRF" --authenticate="true" --authenticate_slaves="true" --authenticators="crammd5" --authorizers="local" --credentials="/tmp/xt1N2F/credentials" --framework_sorter="drf" --help="false" --hostname_lookup="true" --initialize_driver_logging="true" --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" --max_slave_ping_timeouts="5" --quiet="false" --recovery_slave_removal_limit="100%" --registry="replicated_log" --registry_fetch_timeout="1mins" --registry_store_timeout="25secs" --registry_strict="true" --root_submissions="true" --slave_ping_timeout="15secs" --slave_reregister_timeout="10mins" --user_sorter="drf" --version="false" --webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/xt1N2F/master" --zk_session_timeout="10secs" I1202 15:18:00.041157 26346 master.cpp:414] Master only allowing authenticated frameworks to register I1202 15:18:00.041163 26346 master.cpp:419] Master only allowing authenticated slaves to register I1202 15:18:00.041168 26346 credentials.hpp:37] Loading credentials for authentication from '/tmp/xt1N2F/credentials' I1202 15:18:00.041410 26346 master.cpp:458] Using default 'crammd5' authenticator I1202 15:18:00.041524 26346 master.cpp:495] Authorization enabled I1202 15:18:00.042917 26343 recover.cpp:475] Replica is in EMPTY status I1202 15:18:00.043557 26343 master.cpp:1606] The newly elected leader is master@127.0.1.1:33625 with id baeb70c6-c960-4d0d-9dc7-48b9a54ef8a8 I1202 15:18:00.043577 26343 master.cpp:1619] Elected as the leading master! I1202 15:18:00.043589 26343 master.cpp:1379] Recovering from registrar I1202 15:18:00.043766 26343 registrar.cpp:309] Recovering registrar I1202 15:18:00.044668 26344 replica.cpp:676] Replica in EMPTY status received a broadcasted recover request from (21064)@127.0.1.1:33625 I1202 15:18:00.045027 26349 recover.cpp:195] Received a recover response from a replica in EMPTY status I1202 15:18:00.045497 26349 recover.cpp:566] Updating replica status to STARTING I1202 15:18:00.055539 26349 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 9.859161ms I1202 15:18:00.055599 26349 replica.cpp:323] Persisted replica status to STARTING I1202 15:18:00.055958 26346 recover.cpp:475] Replica is in STARTING status I1202 15:18:00.057106 26342 replica.cpp:676] Replica in STARTING status received a broadcasted recover request from (21065)@127.0.1.1:33625 I1202 15:18:00.057462 26343 recover.cpp:195] Received a recover response from a replica in STARTING status I1202 15:18:00.057886 26347 recover.cpp:566] Updating replica status to VOTING I1202 15:18:00.058706 26345 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 634303ns I1202 15:18:00.058724 26345 replica.cpp:323] Persisted replica status to VOTING I1202 15:18:00.058821 26345 recover.cpp:580] Successfully joined the Paxos group I1202 15:18:00.058980 26345 recover.cpp:464] Recover process terminated I1202 15:18:00.059288 26348 log.cpp:661] Attempting to start the writer I1202 15:18:00.060330 26342 replica.cpp:496] Replica received implicit promise request from (21066)@127.0.1.1:33625 with proposal 1 I1202 15:18:00.061751 26342 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 1.395961ms I1202 15:18:00.061774 26342 replica.cpp:345] Persisted promised to 1 I1202 15:18:00.062237 26342 coordinator.cpp:240] Coordinator attempting to fill missing positions I1202 15:18:00.063148 26342 repli
[jira] [Created] (MESOS-4043) CgroupsAnyHierarchyWithFreezerTest.ROOT_CGROUPS_DestroyTracedProcess is Flaky
Alexander Rojas created MESOS-4043: -- Summary: CgroupsAnyHierarchyWithFreezerTest.ROOT_CGROUPS_DestroyTracedProcess is Flaky Key: MESOS-4043 URL: https://issues.apache.org/jira/browse/MESOS-4043 Project: Mesos Issue Type: Bug Components: isolation Environment: Debian 8 (In virutal machine). Build with: {{configure --enable-ssl --enable-libevent --enable-debug}} Reporter: Alexander Rojas Assignee: Timothy Chen Running the test with {code} sudo src/mesos-tests --gtest_repeat=100 --verbose --gtest_break_on_failur {code} yielded at least once: {noformat} [ RUN ] CgroupsAnyHierarchyWithFreezerTest.ROOT_CGROUPS_DestroyTracedProcess I1202 14:59:40.530966 2564 cgroups.cpp:2429] Freezing cgroup /sys/fs/cgroup/freezer/mesos_test I1202 14:59:40.546022 2566 cgroups.cpp:1411] Successfully froze cgroup /sys/fs/cgroup/freezer/mesos_test after 14.974976ms I1202 14:59:40.560233 2566 cgroups.cpp:2447] Thawing cgroup /sys/fs/cgroup/freezer/mesos_test I1202 14:59:40.574983 2570 cgroups.cpp:1440] Successfullly thawed cgroup /sys/fs/cgroup/freezer/mesos_test after 14.671104ms ../../src/tests/containerizer/cgroups_tests.cpp:939: Failure Value of: ::waitpid(pid, &status, 0) Actual: 26319 Expected: -1 *** Aborted at 1449064780 (unix time) try "date -d @1449064780" if you are using GNU date *** PC: @ 0x14b07ae testing::UnitTest::AddTestPartResult() *** SIGSEGV (@0x0) received by PID 2549 (TID 0x7f0017c287c0) from PID 0; stack trace: *** @ 0x7e9b866c os::Linux::chained_handler() @ 0x7e9bca0a JVM_handle_linux_signal @ 0x7f00115458d0 (unknown) @ 0x14b07ae testing::UnitTest::AddTestPartResult() @ 0x14a51e7 testing::internal::AssertHelper::operator=() @ 0x14129f4 mesos::internal::tests::CgroupsAnyHierarchyWithFreezerTest_ROOT_CGROUPS_DestroyTracedProcess_Test::TestBody() @ 0x14ce2d0 testing::internal::HandleSehExceptionsInMethodIfSupported<>() @ 0x14c9248 testing::internal::HandleExceptionsInMethodIfSupported<>() @ 0x14aa587 testing::Test::Run() @ 0x14aad15 testing::TestInfo::Run() @ 0x14ab350 testing::TestCase::Run() @ 0x14b1c9f testing::internal::UnitTestImpl::RunAllTests() @ 0x14cef5f testing::internal::HandleSehExceptionsInMethodIfSupported<>() @ 0x14c9d9e testing::internal::HandleExceptionsInMethodIfSupported<>() @ 0x14b09cf testing::UnitTest::Run() @ 0xd63e02 RUN_ALL_TESTS() @ 0xd639e0 main @ 0x7f00111aeb45 (unknown) @ 0x9588f9 (unknown) {noformat} However running: {code} sudo src/mesos-tests --gtest_filter="CgroupsAnyHierarchyWithFreezerTest.ROOT_CGROUPS_DestroyTracedProcess" --gtest_repeat=1000 --verbose --gtest_break_on_failure {code} Doesn't reproduce the error. It may be cause by a state left by a previous test. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4041) Default command executor do not have executor_id
[ https://issues.apache.org/jira/browse/MESOS-4041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15035836#comment-15035836 ] Klaus Ma commented on MESOS-4041: - [~gyliu], for command line, Master does not know executor information; the executor info is build in Mesos Slave. I'm working on MESOS-1718 to re-build command executor info in Mesos Master; so 1. no resources overcommited, I'm going to cut some resources of command tasks to executor 2. Master has the executor info for Optimistic Offer Phase 1 > Default command executor do not have executor_id > > > Key: MESOS-4041 > URL: https://issues.apache.org/jira/browse/MESOS-4041 > Project: Mesos > Issue Type: Bug >Reporter: Guangya Liu > > I was doing some test with Marathon on top of mesos and found that when using > mesos command executor, the executor_id is always empty. > {code} > "state": "TASK_RUNNING", > "slave_id": "62c4d3e2-7c80-4d80-a0cd-57b2eced1d81-S0", > "resources": { > "ports": "[33505-33505]", > "mem": 16, > "disk": 0, > "cpus": 0.1 > }, > "name": "t1", > "id": "t1.ac8c4679-98cd-11e5-8b71-823a8274cef3", > "framework_id": "62c4d3e2-7c80-4d80-a0cd-57b2eced1d81-0001", > "executor_id": "" > } > {code} > When I test with mesos-executor with command line executor, also no executor > id. > {code} > { > "unregistered_frameworks": [], > "frameworks": [ > { > "webui_url": "", > "user": "root", > "used_resources": { > "mem": 256, > "disk": 0, > "cpus": 1 > }, > "unregistered_time": 0, > "id": "820e082f-be7c-4b59-abc5-9c02f9e8d66d-", > "hostname": "devstack007.cn.ibm.com", > "failover_timeout": 0, > "executors": [], > "completed_tasks": [], > "checkpoint": false, > "capabilities": [], > "active": true, > "name": "", > "offered_resources": { > "mem": 0, > "disk": 0, > "cpus": 0 > }, > "offers": [], > "pid": > "scheduler-43ee3a42-9302-4815-8a07-500e42367f41@9.111.242.187:60454", > "registered_time": 1445236263.95058, > "resources": { > "mem": 256, > "disk": 0, > "cpus": 1 > }, > "role": "*", > "tasks": [ > { > "statuses": [ > { > "timestamp": 1445236266.63443, > "state": "TASK_RUNNING", > "container_status": { > "network_infos": [ > { > "ip_address": "9.111.242.187" > } > ] > } > } > ], > "state": "TASK_RUNNING", > "slave_id": "3e0df733-08b3-4883-b3fa-92bdc0c05b2f-S0", > "resources": { > "mem": 256, > "disk": 0, > "cpus": 1 > }, > "name": "cluster-test", > "id": "cluster-test", > "framework_id": "820e082f-be7c-4b59-abc5-9c02f9e8d66d-", > "executor_id": "" > } > ] > } > ], > "completed_frameworks": [] > } > {code} > This caused the end use can not use http end point to kill some executors as > mesos require executor id when kill executors. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4042) Complete LevelDBStateTest suite fails in optimized build
[ https://issues.apache.org/jira/browse/MESOS-4042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Bannier updated MESOS-4042: Description: Building and checking {{5c0e4dc974014b0afd1f2752ff60a61c651de478}} in a ubuntu14.04 virtualbox with {{--enable-optimized}} in a virtualbox shared folder fails with {code} [ RUN ] LevelDBStateTest.FetchAndStoreAndFetch ../../src/tests/state_tests.cpp:90: Failure (future1).failure(): IO error: /vagrant/mesos/build/.state/MANIFEST-01: Invalid argument [ FAILED ] LevelDBStateTest.FetchAndStoreAndFetch (15 ms) [ RUN ] LevelDBStateTest.FetchAndStoreAndStoreAndFetch ../../src/tests/state_tests.cpp:120: Failure (future1).failure(): IO error: /vagrant/mesos/build/.state/MANIFEST-01: Invalid argument [ FAILED ] LevelDBStateTest.FetchAndStoreAndStoreAndFetch (13 ms) [ RUN ] LevelDBStateTest.FetchAndStoreAndStoreFailAndFetch ../../src/tests/state_tests.cpp:156: Failure (future1).failure(): IO error: /vagrant/mesos/build/.state/MANIFEST-01: Invalid argument [ FAILED ] LevelDBStateTest.FetchAndStoreAndStoreFailAndFetch (10 ms) [ RUN ] LevelDBStateTest.FetchAndStoreAndExpungeAndFetch ../../src/tests/state_tests.cpp:198: Failure (future1).failure(): IO error: /vagrant/mesos/build/.state/MANIFEST-01: Invalid argument [ FAILED ] LevelDBStateTest.FetchAndStoreAndExpungeAndFetch (10 ms) [ RUN ] LevelDBStateTest.FetchAndStoreAndExpungeAndExpunge ../../src/tests/state_tests.cpp:233: Failure (future1).failure(): IO error: /vagrant/mesos/build/.state/MANIFEST-01: Invalid argument [ FAILED ] LevelDBStateTest.FetchAndStoreAndExpungeAndExpunge (10 ms) [ RUN ] LevelDBStateTest.FetchAndStoreAndExpungeAndStoreAndFetch ../../src/tests/state_tests.cpp:264: Failure (future1).failure(): IO error: /vagrant/mesos/build/.state/MANIFEST-01: Invalid argument [ FAILED ] LevelDBStateTest.FetchAndStoreAndExpungeAndStoreAndFetch (12 ms) [ RUN ] LevelDBStateTest.Names ../../src/tests/state_tests.cpp:304: Failure (future1).failure(): IO error: /vagrant/mesos/build/.state/MANIFEST-01: Invalid argument [ FAILED ] LevelDBStateTest.Names (10 ms) {code} The identical error occurs for a non-optimized build. was: Building and checking {{5c0e4dc974014b0afd1f2752ff60a61c651de478}} in a ubuntu14.04 virtualbox with {{--enable-optimized}} in a virtualbox shared folder fails with {code} [ RUN ] LevelDBStateTest.FetchAndStoreAndFetch ../../src/tests/state_tests.cpp:90: Failure (future1).failure(): IO error: /vagrant/mesos/build/.state/MANIFEST-01: Invalid argument [ FAILED ] LevelDBStateTest.FetchAndStoreAndFetch (15 ms) [ RUN ] LevelDBStateTest.FetchAndStoreAndStoreAndFetch ../../src/tests/state_tests.cpp:120: Failure (future1).failure(): IO error: /vagrant/mesos/build/.state/MANIFEST-01: Invalid argument [ FAILED ] LevelDBStateTest.FetchAndStoreAndStoreAndFetch (13 ms) [ RUN ] LevelDBStateTest.FetchAndStoreAndStoreFailAndFetch ../../src/tests/state_tests.cpp:156: Failure (future1).failure(): IO error: /vagrant/mesos/build/.state/MANIFEST-01: Invalid argument [ FAILED ] LevelDBStateTest.FetchAndStoreAndStoreFailAndFetch (10 ms) [ RUN ] LevelDBStateTest.FetchAndStoreAndExpungeAndFetch ../../src/tests/state_tests.cpp:198: Failure (future1).failure(): IO error: /vagrant/mesos/build/.state/MANIFEST-01: Invalid argument [ FAILED ] LevelDBStateTest.FetchAndStoreAndExpungeAndFetch (10 ms) [ RUN ] LevelDBStateTest.FetchAndStoreAndExpungeAndExpunge ../../src/tests/state_tests.cpp:233: Failure (future1).failure(): IO error: /vagrant/mesos/build/.state/MANIFEST-01: Invalid argument [ FAILED ] LevelDBStateTest.FetchAndStoreAndExpungeAndExpunge (10 ms) [ RUN ] LevelDBStateTest.FetchAndStoreAndExpungeAndStoreAndFetch ../../src/tests/state_tests.cpp:264: Failure (future1).failure(): IO error: /vagrant/mesos/build/.state/MANIFEST-01: Invalid argument [ FAILED ] LevelDBStateTest.FetchAndStoreAndExpungeAndStoreAndFetch (12 ms) [ RUN ] LevelDBStateTest.Names ../../src/tests/state_tests.cpp:304: Failure (future1).failure(): IO error: /vagrant/mesos/build/.state/MANIFEST-01: Invalid argument [ FAILED ] LevelDBStateTest.Names (10 ms) {code} At least for me not optimized builds seem unaffected. > Complete LevelDBStateTest suite fails in optimized build > > > Key: MESOS-4042 > URL: https://issues.apache.org/jira/browse/MESOS-4042 > Project: Mesos > Issue Type: Bug >Reporter: Benjamin Bannier > > Building and checking {{5c0e4dc974014b0afd1f2752ff60a61c651de478}} in a > ubuntu14.04 virtualbox with {{--enable-optimized}} in a virtualbox shared > folder fails with > {code} > [ RUN ] LevelDBStateTest.FetchAndStoreAndFetch > ../../src/tests/state_tests.cpp:90: Failure > (fut
[jira] [Created] (MESOS-4042) Complete LevelDBStateTest suite fails in optimized build
Benjamin Bannier created MESOS-4042: --- Summary: Complete LevelDBStateTest suite fails in optimized build Key: MESOS-4042 URL: https://issues.apache.org/jira/browse/MESOS-4042 Project: Mesos Issue Type: Bug Reporter: Benjamin Bannier Building and checking {{5c0e4dc974014b0afd1f2752ff60a61c651de478}} in a ubuntu14.04 virtualbox with {{--enable-optimized}} in a virtualbox shared folder fails with {code} [ RUN ] LevelDBStateTest.FetchAndStoreAndFetch ../../src/tests/state_tests.cpp:90: Failure (future1).failure(): IO error: /vagrant/mesos/build/.state/MANIFEST-01: Invalid argument [ FAILED ] LevelDBStateTest.FetchAndStoreAndFetch (15 ms) [ RUN ] LevelDBStateTest.FetchAndStoreAndStoreAndFetch ../../src/tests/state_tests.cpp:120: Failure (future1).failure(): IO error: /vagrant/mesos/build/.state/MANIFEST-01: Invalid argument [ FAILED ] LevelDBStateTest.FetchAndStoreAndStoreAndFetch (13 ms) [ RUN ] LevelDBStateTest.FetchAndStoreAndStoreFailAndFetch ../../src/tests/state_tests.cpp:156: Failure (future1).failure(): IO error: /vagrant/mesos/build/.state/MANIFEST-01: Invalid argument [ FAILED ] LevelDBStateTest.FetchAndStoreAndStoreFailAndFetch (10 ms) [ RUN ] LevelDBStateTest.FetchAndStoreAndExpungeAndFetch ../../src/tests/state_tests.cpp:198: Failure (future1).failure(): IO error: /vagrant/mesos/build/.state/MANIFEST-01: Invalid argument [ FAILED ] LevelDBStateTest.FetchAndStoreAndExpungeAndFetch (10 ms) [ RUN ] LevelDBStateTest.FetchAndStoreAndExpungeAndExpunge ../../src/tests/state_tests.cpp:233: Failure (future1).failure(): IO error: /vagrant/mesos/build/.state/MANIFEST-01: Invalid argument [ FAILED ] LevelDBStateTest.FetchAndStoreAndExpungeAndExpunge (10 ms) [ RUN ] LevelDBStateTest.FetchAndStoreAndExpungeAndStoreAndFetch ../../src/tests/state_tests.cpp:264: Failure (future1).failure(): IO error: /vagrant/mesos/build/.state/MANIFEST-01: Invalid argument [ FAILED ] LevelDBStateTest.FetchAndStoreAndExpungeAndStoreAndFetch (12 ms) [ RUN ] LevelDBStateTest.Names ../../src/tests/state_tests.cpp:304: Failure (future1).failure(): IO error: /vagrant/mesos/build/.state/MANIFEST-01: Invalid argument [ FAILED ] LevelDBStateTest.Names (10 ms) {code} At least for me not optimized builds seem unaffected. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3215) CgroupsAnyHierarchyWithPerfEventTest failing on Ubuntu 14.04
[ https://issues.apache.org/jira/browse/MESOS-3215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15035689#comment-15035689 ] Bernd Mathiske commented on MESOS-3215: --- Do you have perf enabled on that machine? > CgroupsAnyHierarchyWithPerfEventTest failing on Ubuntu 14.04 > > > Key: MESOS-3215 > URL: https://issues.apache.org/jira/browse/MESOS-3215 > Project: Mesos > Issue Type: Bug >Reporter: Artem Harutyunyan > Labels: mesosphere > > [ RUN ] CgroupsAnyHierarchyWithPerfEventTest.ROOT_CGROUPS_Perf > ../../src/tests/containerizer/cgroups_tests.cpp:172: Failure > (cgroups::destroy(hierarchy, cgroup)).failure(): Failed to remove cgroup > '/sys/fs/cgroup/perf_event/mesos_test': Device or resource busy > ../../src/tests/containerizer/cgroups_tests.cpp:190: Failure > (cgroups::destroy(hierarchy, cgroup)).failure(): Failed to remove cgroup > '/sys/fs/cgroup/perf_event/mesos_test': Device or resource busy > [ FAILED ] CgroupsAnyHierarchyWithPerfEventTest.ROOT_CGROUPS_Perf (9 ms) > [--] 1 test from CgroupsAnyHierarchyWithPerfEventTest (9 ms total) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-4035) UserCgroupIsolatorTest.ROOT_CGROUPS_UserCgroup fails on CentOS 6.6
[ https://issues.apache.org/jira/browse/MESOS-4035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15035684#comment-15035684 ] Bernd Mathiske edited comment on MESOS-4035 at 12/2/15 11:48 AM: - It seems that the right "fix" is adding documentation about needing to have perf support, respectively what to expect on which kinds of VMs on how to set them up. And also auto-disabling the test (ideally prompting a message about why then): https://issues.apache.org/jira/browse/MESOS-3471 was (Author: bernd-mesos): It seems that the right "fix" is adding documentation about needing to have perf support, respectively what to expect on which kinds of VMs on how to set them up. > UserCgroupIsolatorTest.ROOT_CGROUPS_UserCgroup fails on CentOS 6.6 > -- > > Key: MESOS-4035 > URL: https://issues.apache.org/jira/browse/MESOS-4035 > Project: Mesos > Issue Type: Bug > Environment: CentOS6.6 >Reporter: Gilbert Song > > `ROOT_CGROUPS_UserCgroup` on CentOS6.6 with 0.26rc3. The environment setup on > CentOS6.6 is based on latest update of /docs/getting-started.md. Either using > devtoolset-2 or devtoolset-3 returns the same failure. > If running `sudo ./bin/mesos-tests.sh > --gtest_filter="*ROOT_CGROUPS_UserCgroup*"`, it would return failures as > following log: > {noformat} > [==] Running 3 tests from 3 test cases. > [--] Global test environment set-up. > [--] 1 test from UserCgroupIsolatorTest/0, where TypeParam = > mesos::internal::slave::CgroupsMemIsolatorProcess > userdel: user 'mesos.test.unprivileged.user' does not exist > [ RUN ] UserCgroupIsolatorTest/0.ROOT_CGROUPS_UserCgroup > ../../src/tests/mesos.cpp:722: Failure > cgroups::mount(hierarchy, subsystem): '/tmp/mesos_test_cgroup/perf_event' > already exists in the file system > - > We cannot run any cgroups tests that require > a hierarchy with subsystem 'perf_event' > because we failed to find an existing hierarchy > or create a new one (tried '/tmp/mesos_test_cgroup/perf_event'). > You can either remove all existing > hierarchies, or disable this test case > (i.e., --gtest_filter=-UserCgroupIsolatorTest/0.*). > - > ../../src/tests/mesos.cpp:776: Failure > cgroups: '/tmp/mesos_test_cgroup/perf_event' is not a valid hierarchy > [ FAILED ] UserCgroupIsolatorTest/0.ROOT_CGROUPS_UserCgroup, where > TypeParam = mesos::internal::slave::CgroupsMemIsolatorProcess (1 ms) > [--] 1 test from UserCgroupIsolatorTest/0 (1 ms total) > [--] 1 test from UserCgroupIsolatorTest/1, where TypeParam = > mesos::internal::slave::CgroupsCpushareIsolatorProcess > userdel: user 'mesos.test.unprivileged.user' does not exist > [ RUN ] UserCgroupIsolatorTest/1.ROOT_CGROUPS_UserCgroup > ../../src/tests/mesos.cpp:722: Failure > cgroups::mount(hierarchy, subsystem): '/tmp/mesos_test_cgroup/perf_event' > already exists in the file system > - > We cannot run any cgroups tests that require > a hierarchy with subsystem 'perf_event' > because we failed to find an existing hierarchy > or create a new one (tried '/tmp/mesos_test_cgroup/perf_event'). > You can either remove all existing > hierarchies, or disable this test case > (i.e., --gtest_filter=-UserCgroupIsolatorTest/1.*). > - > ../../src/tests/mesos.cpp:776: Failure > cgroups: '/tmp/mesos_test_cgroup/perf_event' is not a valid hierarchy > [ FAILED ] UserCgroupIsolatorTest/1.ROOT_CGROUPS_UserCgroup, where > TypeParam = mesos::internal::slave::CgroupsCpushareIsolatorProcess (4 ms) > [--] 1 test from UserCgroupIsolatorTest/1 (5 ms total) > [--] 1 test from UserCgroupIsolatorTest/2, where TypeParam = > mesos::internal::slave::CgroupsPerfEventIsolatorProcess > userdel: user 'mesos.test.unprivileged.user' does not exist > [ RUN ] UserCgroupIsolatorTest/2.ROOT_CGROUPS_UserCgroup > ../../src/tests/mesos.cpp:722: Failure > cgroups::mount(hierarchy, subsystem): '/tmp/mesos_test_cgroup/perf_event' > already exists in the file system > - > We cannot run any cgroups tests that require > a hierarchy with subsystem 'perf_event' > because we failed to find an existing hierarchy > or create a new one (tried '/tmp/mesos_test_cgroup/perf_event'). > You can either remove all existing > hierarchies, or disable this test case > (i.e., --gtest_filter=-UserCgroupIsolatorTest/2.*). > - > ../../src/tests/mesos.cpp:776: Failure > cgroups: '/tmp/mesos_test_cgroup/perf_event' is not a v
[jira] [Commented] (MESOS-4039) PerfEventIsolatorTest.ROOT_CGROUPS_Sample fails
[ https://issues.apache.org/jira/browse/MESOS-4039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15035687#comment-15035687 ] Bernd Mathiske commented on MESOS-4039: --- It seems that the right "fix" is adding documentation about needing to have perf support, respectively what to expect on which kinds of VMs on how to set them up. And also auto-disabling the test (ideally prompting a message about why then): https://issues.apache.org/jira/browse/MESOS-3471 > PerfEventIsolatorTest.ROOT_CGROUPS_Sample fails > --- > > Key: MESOS-4039 > URL: https://issues.apache.org/jira/browse/MESOS-4039 > Project: Mesos > Issue Type: Bug >Reporter: Greg Mann > Labels: mesosphere, test-fail > > PerfEventIsolatorTest.ROOT_CGROUPS_Sample fails on CentOS 6.6: > {code} > [--] 1 test from PerfEventIsolatorTest > [ RUN ] PerfEventIsolatorTest.ROOT_CGROUPS_Sample > ../../src/tests/containerizer/isolator_tests.cpp:848: Failure > isolator: Perf is not supported > [ FAILED ] PerfEventIsolatorTest.ROOT_CGROUPS_Sample (79 ms) > [--] 1 test from PerfEventIsolatorTest (79 ms total) > [--] Global test environment tear-down > [==] 1 test from 1 test case ran. (86 ms total) > [ PASSED ] 0 tests. > [ FAILED ] 1 test, listed below: > [ FAILED ] PerfEventIsolatorTest.ROOT_CGROUPS_Sample > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-4038) SlaveRecoveryTests, UserCgroupIsolatorTests fail on CentOS 6.6
[ https://issues.apache.org/jira/browse/MESOS-4038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15035685#comment-15035685 ] Bernd Mathiske edited comment on MESOS-4038 at 12/2/15 11:48 AM: - It seems that the right "fix" is adding documentation about needing to have perf support, respectively what to expect on which kinds of VMs on how to set them up. And also auto-disabling the test (ideally prompting a message about why then): https://issues.apache.org/jira/browse/MESOS-3471 was (Author: bernd-mesos): It seems that the right "fix" is adding documentation about needing to have perf support, respectively what to expect on which kinds of VMs on how to set them up. > SlaveRecoveryTests, UserCgroupIsolatorTests fail on CentOS 6.6 > -- > > Key: MESOS-4038 > URL: https://issues.apache.org/jira/browse/MESOS-4038 > Project: Mesos > Issue Type: Bug > Environment: CentOS 6.6 >Reporter: Greg Mann > Labels: mesosphere, test-failure > > All {{SlaveRecoveryTest.\*}} tests, > {{MesosContainerizerSlaveRecoveryTest.\*}} tests, and > {{UserCgroupIsolatorTest*}} tests fail on CentOS 6.6 with {{TypeParam = > mesos::internal::slave::MesosContainerizer}}. They all fail with the same > error: > {code} > [--] 1 test from SlaveRecoveryTest/0, where TypeParam = > mesos::internal::slave::MesosContainerizer > [ RUN ] SlaveRecoveryTest/0.ReconnectExecutor > ../../src/tests/mesos.cpp:722: Failure > cgroups::mount(hierarchy, subsystem): '/cgroup/perf_event' already exists in > the file system > - > We cannot run any cgroups tests that require > a hierarchy with subsystem 'perf_event' > because we failed to find an existing hierarchy > or create a new one (tried '/cgroup/perf_event'). > You can either remove all existing > hierarchies, or disable this test case > (i.e., --gtest_filter=-SlaveRecoveryTest/0.*). > - > ../../src/tests/mesos.cpp:776: Failure > cgroups: '/cgroup/perf_event' is not a valid hierarchy > [ FAILED ] SlaveRecoveryTest/0.ReconnectExecutor, where TypeParam = > mesos::internal::slave::MesosContainerizer (8 ms) > [--] 1 test from SlaveRecoveryTest/0 (9 ms total) > [--] Global test environment tear-down > [==] 1 test from 1 test case ran. (15 ms total) > [ PASSED ] 0 tests. > [ FAILED ] 1 test, listed below: > [ FAILED ] SlaveRecoveryTest/0.ReconnectExecutor, where TypeParam = > mesos::internal::slave::MesosContainerizer > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4038) SlaveRecoveryTests, UserCgroupIsolatorTests fail on CentOS 6.6
[ https://issues.apache.org/jira/browse/MESOS-4038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15035685#comment-15035685 ] Bernd Mathiske commented on MESOS-4038: --- It seems that the right "fix" is adding documentation about needing to have perf support, respectively what to expect on which kinds of VMs on how to set them up. > SlaveRecoveryTests, UserCgroupIsolatorTests fail on CentOS 6.6 > -- > > Key: MESOS-4038 > URL: https://issues.apache.org/jira/browse/MESOS-4038 > Project: Mesos > Issue Type: Bug > Environment: CentOS 6.6 >Reporter: Greg Mann > Labels: mesosphere, test-failure > > All {{SlaveRecoveryTest.\*}} tests, > {{MesosContainerizerSlaveRecoveryTest.\*}} tests, and > {{UserCgroupIsolatorTest*}} tests fail on CentOS 6.6 with {{TypeParam = > mesos::internal::slave::MesosContainerizer}}. They all fail with the same > error: > {code} > [--] 1 test from SlaveRecoveryTest/0, where TypeParam = > mesos::internal::slave::MesosContainerizer > [ RUN ] SlaveRecoveryTest/0.ReconnectExecutor > ../../src/tests/mesos.cpp:722: Failure > cgroups::mount(hierarchy, subsystem): '/cgroup/perf_event' already exists in > the file system > - > We cannot run any cgroups tests that require > a hierarchy with subsystem 'perf_event' > because we failed to find an existing hierarchy > or create a new one (tried '/cgroup/perf_event'). > You can either remove all existing > hierarchies, or disable this test case > (i.e., --gtest_filter=-SlaveRecoveryTest/0.*). > - > ../../src/tests/mesos.cpp:776: Failure > cgroups: '/cgroup/perf_event' is not a valid hierarchy > [ FAILED ] SlaveRecoveryTest/0.ReconnectExecutor, where TypeParam = > mesos::internal::slave::MesosContainerizer (8 ms) > [--] 1 test from SlaveRecoveryTest/0 (9 ms total) > [--] Global test environment tear-down > [==] 1 test from 1 test case ran. (15 ms total) > [ PASSED ] 0 tests. > [ FAILED ] 1 test, listed below: > [ FAILED ] SlaveRecoveryTest/0.ReconnectExecutor, where TypeParam = > mesos::internal::slave::MesosContainerizer > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4035) UserCgroupIsolatorTest.ROOT_CGROUPS_UserCgroup fails on CentOS 6.6
[ https://issues.apache.org/jira/browse/MESOS-4035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bernd Mathiske updated MESOS-4035: -- Target Version/s: 0.27.0 > UserCgroupIsolatorTest.ROOT_CGROUPS_UserCgroup fails on CentOS 6.6 > -- > > Key: MESOS-4035 > URL: https://issues.apache.org/jira/browse/MESOS-4035 > Project: Mesos > Issue Type: Bug > Environment: CentOS6.6 >Reporter: Gilbert Song > > `ROOT_CGROUPS_UserCgroup` on CentOS6.6 with 0.26rc3. The environment setup on > CentOS6.6 is based on latest update of /docs/getting-started.md. Either using > devtoolset-2 or devtoolset-3 returns the same failure. > If running `sudo ./bin/mesos-tests.sh > --gtest_filter="*ROOT_CGROUPS_UserCgroup*"`, it would return failures as > following log: > {noformat} > [==] Running 3 tests from 3 test cases. > [--] Global test environment set-up. > [--] 1 test from UserCgroupIsolatorTest/0, where TypeParam = > mesos::internal::slave::CgroupsMemIsolatorProcess > userdel: user 'mesos.test.unprivileged.user' does not exist > [ RUN ] UserCgroupIsolatorTest/0.ROOT_CGROUPS_UserCgroup > ../../src/tests/mesos.cpp:722: Failure > cgroups::mount(hierarchy, subsystem): '/tmp/mesos_test_cgroup/perf_event' > already exists in the file system > - > We cannot run any cgroups tests that require > a hierarchy with subsystem 'perf_event' > because we failed to find an existing hierarchy > or create a new one (tried '/tmp/mesos_test_cgroup/perf_event'). > You can either remove all existing > hierarchies, or disable this test case > (i.e., --gtest_filter=-UserCgroupIsolatorTest/0.*). > - > ../../src/tests/mesos.cpp:776: Failure > cgroups: '/tmp/mesos_test_cgroup/perf_event' is not a valid hierarchy > [ FAILED ] UserCgroupIsolatorTest/0.ROOT_CGROUPS_UserCgroup, where > TypeParam = mesos::internal::slave::CgroupsMemIsolatorProcess (1 ms) > [--] 1 test from UserCgroupIsolatorTest/0 (1 ms total) > [--] 1 test from UserCgroupIsolatorTest/1, where TypeParam = > mesos::internal::slave::CgroupsCpushareIsolatorProcess > userdel: user 'mesos.test.unprivileged.user' does not exist > [ RUN ] UserCgroupIsolatorTest/1.ROOT_CGROUPS_UserCgroup > ../../src/tests/mesos.cpp:722: Failure > cgroups::mount(hierarchy, subsystem): '/tmp/mesos_test_cgroup/perf_event' > already exists in the file system > - > We cannot run any cgroups tests that require > a hierarchy with subsystem 'perf_event' > because we failed to find an existing hierarchy > or create a new one (tried '/tmp/mesos_test_cgroup/perf_event'). > You can either remove all existing > hierarchies, or disable this test case > (i.e., --gtest_filter=-UserCgroupIsolatorTest/1.*). > - > ../../src/tests/mesos.cpp:776: Failure > cgroups: '/tmp/mesos_test_cgroup/perf_event' is not a valid hierarchy > [ FAILED ] UserCgroupIsolatorTest/1.ROOT_CGROUPS_UserCgroup, where > TypeParam = mesos::internal::slave::CgroupsCpushareIsolatorProcess (4 ms) > [--] 1 test from UserCgroupIsolatorTest/1 (5 ms total) > [--] 1 test from UserCgroupIsolatorTest/2, where TypeParam = > mesos::internal::slave::CgroupsPerfEventIsolatorProcess > userdel: user 'mesos.test.unprivileged.user' does not exist > [ RUN ] UserCgroupIsolatorTest/2.ROOT_CGROUPS_UserCgroup > ../../src/tests/mesos.cpp:722: Failure > cgroups::mount(hierarchy, subsystem): '/tmp/mesos_test_cgroup/perf_event' > already exists in the file system > - > We cannot run any cgroups tests that require > a hierarchy with subsystem 'perf_event' > because we failed to find an existing hierarchy > or create a new one (tried '/tmp/mesos_test_cgroup/perf_event'). > You can either remove all existing > hierarchies, or disable this test case > (i.e., --gtest_filter=-UserCgroupIsolatorTest/2.*). > - > ../../src/tests/mesos.cpp:776: Failure > cgroups: '/tmp/mesos_test_cgroup/perf_event' is not a valid hierarchy > [ FAILED ] UserCgroupIsolatorTest/2.ROOT_CGROUPS_UserCgroup, where > TypeParam = mesos::internal::slave::CgroupsPerfEventIsolatorProcess (2 ms) > [--] 1 test from UserCgroupIsolatorTest/2 (2 ms total) > [--] Global test environment tear-down > [==] 3 tests from 3 test cases ran. (349 ms total) > [ PASSED ] 0 tests. > [ FAILED ] 3 tests, listed below: > [ FAILED ] UserCgroupIsolatorTest/0.ROOT_CGROUPS_UserCgroup, where > TypeParam = mesos::internal::slave::CgroupsMemIsolatorProcess > [ FAILED ] UserCgroupIsolat
[jira] [Commented] (MESOS-4035) UserCgroupIsolatorTest.ROOT_CGROUPS_UserCgroup fails on CentOS 6.6
[ https://issues.apache.org/jira/browse/MESOS-4035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15035684#comment-15035684 ] Bernd Mathiske commented on MESOS-4035: --- It seems that the right "fix" is adding documentation about needing to have perf support, respectively what to expect on which kinds of VMs on how to set them up. > UserCgroupIsolatorTest.ROOT_CGROUPS_UserCgroup fails on CentOS 6.6 > -- > > Key: MESOS-4035 > URL: https://issues.apache.org/jira/browse/MESOS-4035 > Project: Mesos > Issue Type: Bug > Environment: CentOS6.6 >Reporter: Gilbert Song > > `ROOT_CGROUPS_UserCgroup` on CentOS6.6 with 0.26rc3. The environment setup on > CentOS6.6 is based on latest update of /docs/getting-started.md. Either using > devtoolset-2 or devtoolset-3 returns the same failure. > If running `sudo ./bin/mesos-tests.sh > --gtest_filter="*ROOT_CGROUPS_UserCgroup*"`, it would return failures as > following log: > {noformat} > [==] Running 3 tests from 3 test cases. > [--] Global test environment set-up. > [--] 1 test from UserCgroupIsolatorTest/0, where TypeParam = > mesos::internal::slave::CgroupsMemIsolatorProcess > userdel: user 'mesos.test.unprivileged.user' does not exist > [ RUN ] UserCgroupIsolatorTest/0.ROOT_CGROUPS_UserCgroup > ../../src/tests/mesos.cpp:722: Failure > cgroups::mount(hierarchy, subsystem): '/tmp/mesos_test_cgroup/perf_event' > already exists in the file system > - > We cannot run any cgroups tests that require > a hierarchy with subsystem 'perf_event' > because we failed to find an existing hierarchy > or create a new one (tried '/tmp/mesos_test_cgroup/perf_event'). > You can either remove all existing > hierarchies, or disable this test case > (i.e., --gtest_filter=-UserCgroupIsolatorTest/0.*). > - > ../../src/tests/mesos.cpp:776: Failure > cgroups: '/tmp/mesos_test_cgroup/perf_event' is not a valid hierarchy > [ FAILED ] UserCgroupIsolatorTest/0.ROOT_CGROUPS_UserCgroup, where > TypeParam = mesos::internal::slave::CgroupsMemIsolatorProcess (1 ms) > [--] 1 test from UserCgroupIsolatorTest/0 (1 ms total) > [--] 1 test from UserCgroupIsolatorTest/1, where TypeParam = > mesos::internal::slave::CgroupsCpushareIsolatorProcess > userdel: user 'mesos.test.unprivileged.user' does not exist > [ RUN ] UserCgroupIsolatorTest/1.ROOT_CGROUPS_UserCgroup > ../../src/tests/mesos.cpp:722: Failure > cgroups::mount(hierarchy, subsystem): '/tmp/mesos_test_cgroup/perf_event' > already exists in the file system > - > We cannot run any cgroups tests that require > a hierarchy with subsystem 'perf_event' > because we failed to find an existing hierarchy > or create a new one (tried '/tmp/mesos_test_cgroup/perf_event'). > You can either remove all existing > hierarchies, or disable this test case > (i.e., --gtest_filter=-UserCgroupIsolatorTest/1.*). > - > ../../src/tests/mesos.cpp:776: Failure > cgroups: '/tmp/mesos_test_cgroup/perf_event' is not a valid hierarchy > [ FAILED ] UserCgroupIsolatorTest/1.ROOT_CGROUPS_UserCgroup, where > TypeParam = mesos::internal::slave::CgroupsCpushareIsolatorProcess (4 ms) > [--] 1 test from UserCgroupIsolatorTest/1 (5 ms total) > [--] 1 test from UserCgroupIsolatorTest/2, where TypeParam = > mesos::internal::slave::CgroupsPerfEventIsolatorProcess > userdel: user 'mesos.test.unprivileged.user' does not exist > [ RUN ] UserCgroupIsolatorTest/2.ROOT_CGROUPS_UserCgroup > ../../src/tests/mesos.cpp:722: Failure > cgroups::mount(hierarchy, subsystem): '/tmp/mesos_test_cgroup/perf_event' > already exists in the file system > - > We cannot run any cgroups tests that require > a hierarchy with subsystem 'perf_event' > because we failed to find an existing hierarchy > or create a new one (tried '/tmp/mesos_test_cgroup/perf_event'). > You can either remove all existing > hierarchies, or disable this test case > (i.e., --gtest_filter=-UserCgroupIsolatorTest/2.*). > - > ../../src/tests/mesos.cpp:776: Failure > cgroups: '/tmp/mesos_test_cgroup/perf_event' is not a valid hierarchy > [ FAILED ] UserCgroupIsolatorTest/2.ROOT_CGROUPS_UserCgroup, where > TypeParam = mesos::internal::slave::CgroupsPerfEventIsolatorProcess (2 ms) > [--] 1 test from UserCgroupIsolatorTest/2 (2 ms total) > [--] Global test environment tear-down > [==] 3 tests from 3 test cases ran. (349 ms total) > [ PASSED ] 0 tests. > [ FAILED
[jira] [Commented] (MESOS-3586) MemoryPressureMesosTest.CGROUPS_ROOT_Statistics and CGROUPS_ROOT_SlaveRecovery are flaky
[ https://issues.apache.org/jira/browse/MESOS-3586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15035656#comment-15035656 ] Jan Schlicht commented on MESOS-3586: - Thanks! Can confirm that this fixes the flakiness with CentOS 7.1 for me. I was running {{sudo ./bin/mesos-tests.sh - --gtest_filter="MemoryPressureMesosTest.CGROUPS_ROOT_Statistics" --gtest_repeat=100 --gtest_break_on_failure}} > MemoryPressureMesosTest.CGROUPS_ROOT_Statistics and > CGROUPS_ROOT_SlaveRecovery are flaky > > > Key: MESOS-3586 > URL: https://issues.apache.org/jira/browse/MESOS-3586 > Project: Mesos > Issue Type: Bug > Components: test >Affects Versions: 0.24.0, 0.26.0 > Environment: Ubuntu 14.04, 3.13.0-32 generic > Debian 8, gcc 4.9.2 >Reporter: Miguel Bernadin >Assignee: Joseph Wu > Labels: flaky, flaky-test > > I am install Mesos 0.24.0 on 4 servers which have very similar hardware and > software configurations. > After performing ../configure, make, and make check some servers have > completed successfully and other failed on test [ RUN ] > MemoryPressureMesosTest.CGROUPS_ROOT_Statistics. > Is there something I should check in this test? > PERFORMED MAKE CHECK NODE-001 > [ RUN ] MemoryPressureMesosTest.CGROUPS_ROOT_Statistics > I1005 14:37:35.585067 38479 exec.cpp:133] Version: 0.24.0 > I1005 14:37:35.593789 38497 exec.cpp:207] Executor registered on slave > 20151005-143735-2393768202-35106-27900-S0 > Registered executor on svdidac038.techlabs.accenture.com > Starting task 010b2fe9-4eac-4136-8a8a-6ce7665488b0 > Forked command at 38510 > sh -c 'while true; do dd count=512 bs=1M if=/dev/zero of=./temp; done' > PERFORMED MAKE CHECK NODE-002 > [ RUN ] MemoryPressureMesosTest.CGROUPS_ROOT_Statistics > I1005 14:38:58.794112 36997 exec.cpp:133] Version: 0.24.0 > I1005 14:38:58.802851 37022 exec.cpp:207] Executor registered on slave > 20151005-143857-2360213770-50427-26325-S0 > Registered executor on svdidac039.techlabs.accenture.com > Starting task 9bb317ba-41cb-44a4-b507-d1c85ceabc28 > sh -c 'while true; do dd count=512 bs=1M if=/dev/zero of=./temp; done' > Forked command at 37028 > ../../src/tests/containerizer/memory_pressure_tests.cpp:145: Failure > Expected: (usage.get().mem_medium_pressure_counter()) >= > (usage.get().mem_critical_pressure_counter()), actual: 5 vs 6 > 2015-10-05 > 14:39:00,130:26325(0x2af08cc78700):ZOO_ERROR@handle_socket_error_msg@1697: > Socket [127.0.0.1:37198] zk retcode=-4, errno=111(Connection refused): server > refused to accept the client > [ FAILED ] MemoryPressureMesosTest.CGROUPS_ROOT_Statistics (4303 ms) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4041) Default command executor do not have executor_id
[ https://issues.apache.org/jira/browse/MESOS-4041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15035618#comment-15035618 ] Guangya Liu commented on MESOS-4041: The reason is that frameork does not set the executor id, both mesos-executor and marathon: https://github.com/mesosphere/marathon/blob/master/src/main/scala/mesosphere/mesos/TaskBuilder.scala#L162 This will cause optimistic offer phase 1 does not work if no executor id as the mesos master cannot get executor id to evict. If we want marathon also work with optimistic offer phase 1, then we may need to udpate marathon to set executor info even with command executor. > Default command executor do not have executor_id > > > Key: MESOS-4041 > URL: https://issues.apache.org/jira/browse/MESOS-4041 > Project: Mesos > Issue Type: Bug >Reporter: Guangya Liu > > I was doing some test with Marathon on top of mesos and found that when using > mesos command executor, the executor_id is always empty. > {code} > "state": "TASK_RUNNING", > "slave_id": "62c4d3e2-7c80-4d80-a0cd-57b2eced1d81-S0", > "resources": { > "ports": "[33505-33505]", > "mem": 16, > "disk": 0, > "cpus": 0.1 > }, > "name": "t1", > "id": "t1.ac8c4679-98cd-11e5-8b71-823a8274cef3", > "framework_id": "62c4d3e2-7c80-4d80-a0cd-57b2eced1d81-0001", > "executor_id": "" > } > {code} > When I test with mesos-executor with command line executor, also no executor > id. > {code} > { > "unregistered_frameworks": [], > "frameworks": [ > { > "webui_url": "", > "user": "root", > "used_resources": { > "mem": 256, > "disk": 0, > "cpus": 1 > }, > "unregistered_time": 0, > "id": "820e082f-be7c-4b59-abc5-9c02f9e8d66d-", > "hostname": "devstack007.cn.ibm.com", > "failover_timeout": 0, > "executors": [], > "completed_tasks": [], > "checkpoint": false, > "capabilities": [], > "active": true, > "name": "", > "offered_resources": { > "mem": 0, > "disk": 0, > "cpus": 0 > }, > "offers": [], > "pid": > "scheduler-43ee3a42-9302-4815-8a07-500e42367f41@9.111.242.187:60454", > "registered_time": 1445236263.95058, > "resources": { > "mem": 256, > "disk": 0, > "cpus": 1 > }, > "role": "*", > "tasks": [ > { > "statuses": [ > { > "timestamp": 1445236266.63443, > "state": "TASK_RUNNING", > "container_status": { > "network_infos": [ > { > "ip_address": "9.111.242.187" > } > ] > } > } > ], > "state": "TASK_RUNNING", > "slave_id": "3e0df733-08b3-4883-b3fa-92bdc0c05b2f-S0", > "resources": { > "mem": 256, > "disk": 0, > "cpus": 1 > }, > "name": "cluster-test", > "id": "cluster-test", > "framework_id": "820e082f-be7c-4b59-abc5-9c02f9e8d66d-", > "executor_id": "" > } > ] > } > ], > "completed_frameworks": [] > } > {code} > This caused the end use can not use http end point to kill some executors as > mesos require executor id when kill executors. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4041) Default command executor do not have executor_id
[ https://issues.apache.org/jira/browse/MESOS-4041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15035620#comment-15035620 ] Guangya Liu commented on MESOS-4041: [~jvanremoortere] [~kaysoky] any comments? Thanks. > Default command executor do not have executor_id > > > Key: MESOS-4041 > URL: https://issues.apache.org/jira/browse/MESOS-4041 > Project: Mesos > Issue Type: Bug >Reporter: Guangya Liu > > I was doing some test with Marathon on top of mesos and found that when using > mesos command executor, the executor_id is always empty. > {code} > "state": "TASK_RUNNING", > "slave_id": "62c4d3e2-7c80-4d80-a0cd-57b2eced1d81-S0", > "resources": { > "ports": "[33505-33505]", > "mem": 16, > "disk": 0, > "cpus": 0.1 > }, > "name": "t1", > "id": "t1.ac8c4679-98cd-11e5-8b71-823a8274cef3", > "framework_id": "62c4d3e2-7c80-4d80-a0cd-57b2eced1d81-0001", > "executor_id": "" > } > {code} > When I test with mesos-executor with command line executor, also no executor > id. > {code} > { > "unregistered_frameworks": [], > "frameworks": [ > { > "webui_url": "", > "user": "root", > "used_resources": { > "mem": 256, > "disk": 0, > "cpus": 1 > }, > "unregistered_time": 0, > "id": "820e082f-be7c-4b59-abc5-9c02f9e8d66d-", > "hostname": "devstack007.cn.ibm.com", > "failover_timeout": 0, > "executors": [], > "completed_tasks": [], > "checkpoint": false, > "capabilities": [], > "active": true, > "name": "", > "offered_resources": { > "mem": 0, > "disk": 0, > "cpus": 0 > }, > "offers": [], > "pid": > "scheduler-43ee3a42-9302-4815-8a07-500e42367f41@9.111.242.187:60454", > "registered_time": 1445236263.95058, > "resources": { > "mem": 256, > "disk": 0, > "cpus": 1 > }, > "role": "*", > "tasks": [ > { > "statuses": [ > { > "timestamp": 1445236266.63443, > "state": "TASK_RUNNING", > "container_status": { > "network_infos": [ > { > "ip_address": "9.111.242.187" > } > ] > } > } > ], > "state": "TASK_RUNNING", > "slave_id": "3e0df733-08b3-4883-b3fa-92bdc0c05b2f-S0", > "resources": { > "mem": 256, > "disk": 0, > "cpus": 1 > }, > "name": "cluster-test", > "id": "cluster-test", > "framework_id": "820e082f-be7c-4b59-abc5-9c02f9e8d66d-", > "executor_id": "" > } > ] > } > ], > "completed_frameworks": [] > } > {code} > This caused the end use can not use http end point to kill some executors as > mesos require executor id when kill executors. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-3861) Authenticate quota requests
[ https://issues.apache.org/jira/browse/MESOS-3861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Rukletsov updated MESOS-3861: --- Description: Quota requests need to be authenticated. This ticket will authenticate quota requests using credentials provided by the {{Authorization}} field of the HTTP request. This is similar to how authentication is implemented in {{Master::Http}}. was: Quota requests need to be authenticated. This ticket will authenticate quota requests using credentials provided by the `Authorization` field of the HTTP request. This is similar to how authentication is implemented in `Master::Http`. > Authenticate quota requests > --- > > Key: MESOS-3861 > URL: https://issues.apache.org/jira/browse/MESOS-3861 > Project: Mesos > Issue Type: Task > Components: master >Reporter: Jan Schlicht >Assignee: Jan Schlicht > Labels: mesosphere, security > > Quota requests need to be authenticated. > This ticket will authenticate quota requests using credentials provided by > the {{Authorization}} field of the HTTP request. This is similar to how > authentication is implemented in {{Master::Http}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4025) SlaveRecoveryTest/0.GCExecutor is flaky.
[ https://issues.apache.org/jira/browse/MESOS-4025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jan Schlicht updated MESOS-4025: Sprint: (was: Mesosphere Sprint 23) Story Points: (was: 3) > SlaveRecoveryTest/0.GCExecutor is flaky. > > > Key: MESOS-4025 > URL: https://issues.apache.org/jira/browse/MESOS-4025 > Project: Mesos > Issue Type: Bug > Components: test >Affects Versions: 0.26.0 >Reporter: Till Toenshoff > Labels: flaky, flaky-test, test > > Build was SSL enabled (--enable-ssl, --enable-libevent). The build was based > on 0.26.0-rc1. > Testsuite was run as root. > {noformat} > sudo ./bin/mesos-tests.sh --gtest_break_on_failure --gtest_repeat=-1 > {noformat} > {noformat} > [ RUN ] SlaveRecoveryTest/0.GCExecutor > I1130 16:49:16.336833 1032 exec.cpp:136] Version: 0.26.0 > I1130 16:49:16.345212 1049 exec.cpp:210] Executor registered on slave > dde9fd4e-b016-4a99-9081-b047e9df9afa-S0 > Registered executor on ubuntu14 > Starting task 22c63bba-cbf8-46fd-b23a-5409d69e4114 > sh -c 'sleep 1000' > Forked command at 1057 > ../../src/tests/mesos.cpp:779: Failure > (cgroups::destroy(hierarchy, cgroup)).failure(): Failed to remove cgroup > '/sys/fs/cgroup/memory/mesos_test_e5edb2a8-9af3-441f-b991-613082f264e2/slave': > Device or resource busy > *** Aborted at 1448902156 (unix time) try "date -d @1448902156" if you are > using GNU date *** > PC: @ 0x1443e9a testing::UnitTest::AddTestPartResult() > *** SIGSEGV (@0x0) received by PID 27364 (TID 0x7f1bfdd2b800) from PID 0; > stack trace: *** > @ 0x7f1be92b80b7 os::Linux::chained_handler() > @ 0x7f1be92bc219 JVM_handle_linux_signal > @ 0x7f1bf7bbc340 (unknown) > @ 0x1443e9a testing::UnitTest::AddTestPartResult() > @ 0x1438b99 testing::internal::AssertHelper::operator=() > @ 0xf0b3bb > mesos::internal::tests::ContainerizerTest<>::TearDown() > @ 0x1461882 > testing::internal::HandleSehExceptionsInMethodIfSupported<>() > @ 0x145c6f8 > testing::internal::HandleExceptionsInMethodIfSupported<>() > @ 0x143de4a testing::Test::Run() > @ 0x143e584 testing::TestInfo::Run() > @ 0x143ebca testing::TestCase::Run() > @ 0x1445312 testing::internal::UnitTestImpl::RunAllTests() > @ 0x14624a7 > testing::internal::HandleSehExceptionsInMethodIfSupported<>() > @ 0x145d26e > testing::internal::HandleExceptionsInMethodIfSupported<>() > @ 0x14440ae testing::UnitTest::Run() > @ 0xd15cd4 RUN_ALL_TESTS() > @ 0xd158c1 main > @ 0x7f1bf7808ec5 (unknown) > @ 0x913009 (unknown) > {noformat} > My Vagrantfile generator; > {noformat} > #!/usr/bin/env bash > cat << EOF > Vagrantfile > # -*- mode: ruby -*-" > > # vi: set ft=ruby : > Vagrant.configure(2) do |config| > # Disable shared folder to prevent certain kernel module dependencies. > config.vm.synced_folder ".", "/vagrant", disabled: true > config.vm.box = "bento/ubuntu-14.04" > config.vm.hostname = "${PLATFORM_NAME}" > config.vm.provider "virtualbox" do |vb| > vb.memory = ${VAGRANT_MEM} > vb.cpus = ${VAGRANT_CPUS} > vb.customize ["modifyvm", :id, "--nictype1", "virtio"] > vb.customize ["modifyvm", :id, "--natdnshostresolver1", "on"] > vb.customize ["modifyvm", :id, "--natdnsproxy1", "on"] > end > config.vm.provider "vmware_fusion" do |vb| > vb.memory = ${VAGRANT_MEM} > vb.cpus = ${VAGRANT_CPUS} > end > config.vm.provision "file", source: "../test.sh", destination: "~/test.sh" > config.vm.provision "shell", inline: <<-SHELL > sudo apt-get update > sudo apt-get -y install openjdk-7-jdk autoconf libtool > sudo apt-get -y install build-essential python-dev python-boto \ > libcurl4-nss-dev libsasl2-dev maven \ > libapr1-dev libsvn-dev libssl-dev libevent-dev > sudo apt-get -y install git > sudo wget -qO- https://get.docker.com/ | sh > SHELL > end > EOF > {noformat} > The problem is kicking in frequently in my tests - I'ld say > 10% but less > than 50%. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-4041) Default command executor do not have executor_id
Guangya Liu created MESOS-4041: -- Summary: Default command executor do not have executor_id Key: MESOS-4041 URL: https://issues.apache.org/jira/browse/MESOS-4041 Project: Mesos Issue Type: Bug Reporter: Guangya Liu I was doing some test with Marathon on top of mesos and found that when using mesos command executor, the executor_id is always empty. {code} "state": "TASK_RUNNING", "slave_id": "62c4d3e2-7c80-4d80-a0cd-57b2eced1d81-S0", "resources": { "ports": "[33505-33505]", "mem": 16, "disk": 0, "cpus": 0.1 }, "name": "t1", "id": "t1.ac8c4679-98cd-11e5-8b71-823a8274cef3", "framework_id": "62c4d3e2-7c80-4d80-a0cd-57b2eced1d81-0001", "executor_id": "" } {code} When I test with mesos-executor with command line executor, also no executor id. {code} { "unregistered_frameworks": [], "frameworks": [ { "webui_url": "", "user": "root", "used_resources": { "mem": 256, "disk": 0, "cpus": 1 }, "unregistered_time": 0, "id": "820e082f-be7c-4b59-abc5-9c02f9e8d66d-", "hostname": "devstack007.cn.ibm.com", "failover_timeout": 0, "executors": [], "completed_tasks": [], "checkpoint": false, "capabilities": [], "active": true, "name": "", "offered_resources": { "mem": 0, "disk": 0, "cpus": 0 }, "offers": [], "pid": "scheduler-43ee3a42-9302-4815-8a07-500e42367f41@9.111.242.187:60454", "registered_time": 1445236263.95058, "resources": { "mem": 256, "disk": 0, "cpus": 1 }, "role": "*", "tasks": [ { "statuses": [ { "timestamp": 1445236266.63443, "state": "TASK_RUNNING", "container_status": { "network_infos": [ { "ip_address": "9.111.242.187" } ] } } ], "state": "TASK_RUNNING", "slave_id": "3e0df733-08b3-4883-b3fa-92bdc0c05b2f-S0", "resources": { "mem": 256, "disk": 0, "cpus": 1 }, "name": "cluster-test", "id": "cluster-test", "framework_id": "820e082f-be7c-4b59-abc5-9c02f9e8d66d-", "executor_id": "" } ] } ], "completed_frameworks": [] } {code} This caused the end use can not use http end point to kill some executors as mesos require executor id when kill executors. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4040) Clang's address sanitizer reports heap-use-after-free in ExamplesTest.EventCallFramework
[ https://issues.apache.org/jira/browse/MESOS-4040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Bannier updated MESOS-4040: Attachment: asan.log > Clang's address sanitizer reports heap-use-after-free in > ExamplesTest.EventCallFramework > > > Key: MESOS-4040 > URL: https://issues.apache.org/jira/browse/MESOS-4040 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.26.0 >Reporter: Benjamin Bannier > Attachments: asan.log > > > For a libevent- and ssl-enabled debug build under ubuntu14.04 with clang-3.6 > and {{CXXFLAGS=-fsanitize=address}} address sanitizer reports a > use-after-free from {{ExamplesTest.EventCallFramework}} (log attached). > If this is not a false positive in could lead to all kinds of issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-4040) Clang's address sanitizer reports heap-use-after-free in ExamplesTest.EventCallFramework
Benjamin Bannier created MESOS-4040: --- Summary: Clang's address sanitizer reports heap-use-after-free in ExamplesTest.EventCallFramework Key: MESOS-4040 URL: https://issues.apache.org/jira/browse/MESOS-4040 Project: Mesos Issue Type: Bug Affects Versions: 0.26.0 Reporter: Benjamin Bannier Attachments: asan.log For a libevent- and ssl-enabled debug build under ubuntu14.04 with clang-3.6 and {{CXXFLAGS=-fsanitize=address}} address sanitizer reports a use-after-free from {{ExamplesTest.EventCallFramework}} (log attached). If this is not a false positive in could lead to all kinds of issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)