[ https://issues.apache.org/jira/browse/MESOS-1154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13960511#comment-13960511 ]
Ian Downes commented on MESOS-1154: ----------------------------------- I cannot reproduce this yet but did notice that I don't get ZK errors logged for either test; are they expected? {noformat} [==========] Running 1 test from 1 test case. [----------] Global test environment set-up. [----------] 1 test from SlaveRecoveryTest/0, where TypeParam = mesos::internal::slave::MesosContainerizer [ RUN ] SlaveRecoveryTest/0.ReconcileKillTask WARNING: Logging before InitGoogleLogging() is written to STDERR I0404 22:26:09.061580 52881 exec.cpp:131] Version: 0.19.0 I0404 22:26:09.064916 52906 exec.cpp:205] Executor registered on slave 20140404-222607-1828659978-41600-52820-0 Registered executor on smfd-atr-11-sr1.devel.twitter.com Starting task 3084d139-253d-435b-9105-e6fdb6ccb01e sh -c 'sleep 1000' Forked command at 52924 I0404 22:26:09.391057 52908 exec.cpp:251] Received reconnect request from slave 20140404-222607-1828659978-41600-52820-0 I0404 22:26:09.391815 52911 exec.cpp:228] Executor re-registered on slave 20140404-222607-1828659978-41600-52820-0 Re-registered executor on smfd-atr-11-sr1.devel.twitter.com Shutting down Sending SIGTERM to process tree at pid 52924 Killing the following process trees: [ --- 52924 sleep 1000 ] Command terminated with signal Terminated (pid: 52924) [ OK ] SlaveRecoveryTest/0.ReconcileKillTask (5196 ms) [----------] 1 test from SlaveRecoveryTest/0 (5197 ms total) [----------] Global test environment tear-down [==========] 1 test from 1 test case ran. (5212 ms total) [ PASSED ] 1 test. YOU HAVE 1 DISABLED TEST {noformat} > Flaky SlaveRecoveryTest test: ReconcileKillTask and > RecoverStatusUpdateManager. > ------------------------------------------------------------------------------- > > Key: MESOS-1154 > URL: https://issues.apache.org/jira/browse/MESOS-1154 > Project: Mesos > Issue Type: Bug > Components: test > Reporter: Benjamin Mahler > Assignee: Ian Downes > > Looks like the test tear down is failing to remove a cgroup in both cases: > {noformat: title=SlaveRecoveryTest/0.ReconcileKillTask} > [ RUN ] SlaveRecoveryTest/0.ReconcileKillTask > 2014-03-27 > 22:32:49,330:44864(0x7fa2c5bab940):ZOO_ERROR@handle_socket_error_msg@1697: > Socket [127.0.0.1:60875] zk retcode=-4, errno=111(Connection refused): server > refused to accept the client > WARNING: Logging before InitGoogleLogging() is written to STDERR > I0327 22:32:50.850909 53927 exec.cpp:131] Version: 0.19.0 > I0327 22:32:50.853888 53953 exec.cpp:205] Executor registered on slave > 20140327-223247-1740121354-49087-44864-0 > Registered executor on smfd-bkq-03-sr4.devel.twitter.com > Starting task bc4f5f79-088e-4188-b9b5-3585ba5e6a98 > sh -c 'sleep 1000' > Forked command at 53967 > 2014-03-27 > 22:32:52,667:44864(0x7fa2c5bab940):ZOO_ERROR@handle_socket_error_msg@1697: > Socket [127.0.0.1:60875] zk retcode=-4, errno=111(Connection refused): server > refused to accept the client > 2014-03-27 > 22:32:56,004:44864(0x7fa2c5bab940):ZOO_ERROR@handle_socket_error_msg@1697: > Socket [127.0.0.1:60875] zk retcode=-4, errno=111(Connection refused): server > refused to accept the client > I0327 22:32:56.032625 53957 exec.cpp:251] Received reconnect request from > slave 20140327-223247-1740121354-49087-44864-0 > I0327 22:32:56.033253 53945 exec.cpp:228] Executor re-registered on slave > 20140327-223247-1740121354-49087-44864-0 > Re-registered executor on smfd-bkq-03-sr4.devel.twitter.com > Shutting down > Sending SIGTERM to process tree at pid 53967 > Killing the following process trees: > [ > --- 53967 sleep 1000 > ] > Command terminated with signal Terminated (pid: 53967) > 2014-03-27 > 22:32:59,341:44864(0x7fa2c5bab940):ZOO_ERROR@handle_socket_error_msg@1697: > Socket [127.0.0.1:60875] zk retcode=-4, errno=111(Connection refused): server > refused to accept the client > ../../src/tests/mesos.cpp:387: Failure > (cgroups::destroy(hierarchy, cgroup)).failure(): > 'mesos_test_cd6a76e9-9961-40ba-b0ed-b1190954975f/0b1f90b8-b5a2-43e9-851e-93ca56cb37d0' > is not a valid cgroup > [ FAILED ] SlaveRecoveryTest/0.ReconcileKillTask, where TypeParam = > mesos::internal::slave::MesosContainerizer (13600 ms) > {noformat} > {noformat: title=SlaveRecoveryTest/0.RecoverStatusUpdateManager} > [ RUN ] SlaveRecoveryTest/0.RecoverStatusUpdateManager > 2014-03-27 > 22:30:12,509:44864(0x7fa2c5bab940):ZOO_ERROR@handle_socket_error_msg@1697: > Socket [127.0.0.1:60875] zk retcode=-4, errno=111(Connection refused): server > refused to accept the client > 2014-03-27 > 22:30:15,845:44864(0x7fa2c5bab940):ZOO_ERROR@handle_socket_error_msg@1697: > Socket [127.0.0.1:60875] zk retcode=-4, errno=111(Connection refused): server > refused to accept the client > WARNING: Logging before InitGoogleLogging() is written to STDERR > I0327 22:30:18.038733 52597 exec.cpp:131] Version: 0.19.0 > I0327 22:30:18.043248 52620 exec.cpp:205] Executor registered on slave > 20140327-223012-1740121354-49087-44864-0 > Registered executor on smfd-bkq-03-sr4.devel.twitter.com > Starting task 4c8dda64-17e8-4390-8f31-f878cdde8228 > sh -c 'sleep 1000' > Forked command at 52637 > 2014-03-27 > 22:30:19,182:44864(0x7fa2c5bab940):ZOO_ERROR@handle_socket_error_msg@1697: > Socket [127.0.0.1:60875] zk retcode=-4, errno=111(Connection refused): server > refused to accept the client > 2014-03-27 > 22:30:22,519:44864(0x7fa2c5bab940):ZOO_ERROR@handle_socket_error_msg@1697: > Socket [127.0.0.1:60875] zk retcode=-4, errno=111(Connection refused): server > refused to accept the client > I0327 22:30:24.087932 52634 exec.cpp:251] Received reconnect request from > slave 20140327-223012-1740121354-49087-44864-0 > I0327 22:30:24.088851 52635 exec.cpp:228] Executor re-registered on slave > 20140327-223012-1740121354-49087-44864-0 > Re-registered executor on smfd-bkq-03-sr4.devel.twitter.com > 2014-03-27 > 22:30:25,855:44864(0x7fa2c5bab940):ZOO_ERROR@handle_socket_error_msg@1697: > Socket [127.0.0.1:60875] zk retcode=-4, errno=111(Connection refused): server > refused to accept the client > I0327 22:30:26.091655 52614 exec.cpp:378] Executor asked to shutdown > Shutting down > Sending SIGTERM to process tree at pid 52637 > Killing the following process trees: > [ > --- 52637 sleep 1000 > ] > Command terminated with signal Terminated (pid: 52637) > ../../src/tests/mesos.cpp:387: Failure > (cgroups::destroy(hierarchy, cgroup)).failure(): Failed to kill tasks in > nested cgroups: Collect failed: > 'mesos_test_87c28b05-8055-407a-8ef5-eda7febd1f1c/302e57d9-e837-46cd-bd81-0e39c5b19564' > is not a valid cgroup > [ FAILED ] SlaveRecoveryTest/0.RecoverStatusUpdateManager, where TypeParam > = mesos::internal::slave::MesosContainerizer (16304 ms) > {noformat} > Seems to be flaky and only occurring sometimes. -- This message was sent by Atlassian JIRA (v6.2#6252)