[jira] [Updated] (MESOS-4279) Graceful restart of docker task
[ https://issues.apache.org/jira/browse/MESOS-4279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Qian Zhang updated MESOS-4279: -- Shepherd: Timothy Chen > Graceful restart of docker task > --- > > Key: MESOS-4279 > URL: https://issues.apache.org/jira/browse/MESOS-4279 > Project: Mesos > Issue Type: Bug > Components: containerization, docker >Affects Versions: 0.25.0 >Reporter: Martin Bydzovsky >Assignee: Qian Zhang > > I'm implementing a graceful restarts of our mesos-marathon-docker setup and I > came to a following issue: > (it was already discussed on > https://github.com/mesosphere/marathon/issues/2876 and guys form mesosphere > got to a point that its probably a docker containerizer problem...) > To sum it up: > When i deploy simple python script to all mesos-slaves: > {code} > #!/usr/bin/python > from time import sleep > import signal > import sys > import datetime > def sigterm_handler(_signo, _stack_frame): > print "got %i" % _signo > print datetime.datetime.now().time() > sys.stdout.flush() > sleep(2) > print datetime.datetime.now().time() > print "ending" > sys.stdout.flush() > sys.exit(0) > signal.signal(signal.SIGTERM, sigterm_handler) > signal.signal(signal.SIGINT, sigterm_handler) > try: > print "Hello" > i = 0 > while True: > i += 1 > print datetime.datetime.now().time() > print "Iteration #%i" % i > sys.stdout.flush() > sleep(1) > finally: > print "Goodbye" > {code} > and I run it through Marathon like > {code:javascript} > data = { > args: ["/tmp/script.py"], > instances: 1, > cpus: 0.1, > mem: 256, > id: "marathon-test-api" > } > {code} > During the app restart I get expected result - the task receives sigterm and > dies peacefully (during my script-specified 2 seconds period) > But when i wrap this python script in a docker: > {code} > FROM node:4.2 > RUN mkdir /app > ADD . /app > WORKDIR /app > ENTRYPOINT [] > {code} > and run appropriate application by Marathon: > {code:javascript} > data = { > args: ["./script.py"], > container: { > type: "DOCKER", > docker: { > image: "bydga/marathon-test-api" > }, > forcePullImage: yes > }, > cpus: 0.1, > mem: 256, > instances: 1, > id: "marathon-test-api" > } > {code} > The task during restart (issued from marathon) dies immediately without > having a chance to do any cleanup. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-3746) Consider introducing a mechanism to provide feedback on offer operations
[ https://issues.apache.org/jira/browse/MESOS-3746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guangya Liu updated MESOS-3746: --- Description: Currently, the master does not provide a direct feedback to the framework when an operation is dropped: https://github.com/apache/mesos/blob/0.26.0/src/master/master.cpp#L1703-L1717 A "subsequent offer" is used as the mechanism to determine whether an operation succeeded or not, which is not sufficient if a framework mistakenly sends invalid operations. There should be an immediate feedback as to whether the request was "accepted". was: Currently, the master does not provide a direct feedback to the framework when an operation is dropped: https://github.com/apache/mesos/blob/master/src/master/master.cpp#L1713-L1715 A "subsequent offer" is used as the mechanism to determine whether an operation succeeded or not, which is not sufficient if a framework mistakenly sends invalid operations. There should be an immediate feedback as to whether the request was "accepted". > Consider introducing a mechanism to provide feedback on offer operations > > > Key: MESOS-3746 > URL: https://issues.apache.org/jira/browse/MESOS-3746 > Project: Mesos > Issue Type: Task > Components: master >Reporter: Michael Park > Labels: mesosphere, persistent-volumes, reservations > > Currently, the master does not provide a direct feedback to the framework > when an operation is dropped: > https://github.com/apache/mesos/blob/0.26.0/src/master/master.cpp#L1703-L1717 > A "subsequent offer" is used as the mechanism to determine whether an > operation succeeded or not, which is not sufficient if a framework mistakenly > sends invalid operations. There should be an immediate feedback as to whether > the request was "accepted". -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4258) Generate xml test reports in the jenkins build.
[ https://issues.apache.org/jira/browse/MESOS-4258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15090360#comment-15090360 ] Shuai Lin commented on MESOS-4258: -- Besides the patch, this would also require a Jenkis admin to configure the locations of the xml files, as described in the "Configuration" section of [Jenkins xUnit Plugin Page|https://wiki.jenkins-ci.org/display/JENKINS/xUnit+Plugin]. Here we have three xml reports: - 3rdparty/libprocess/3rdparty/report.xml - 3rdparty/libprocess/report.xml - src/report.xml > Generate xml test reports in the jenkins build. > --- > > Key: MESOS-4258 > URL: https://issues.apache.org/jira/browse/MESOS-4258 > Project: Mesos > Issue Type: Task > Components: test >Reporter: Benjamin Mahler >Assignee: Shuai Lin > Labels: newbie > > Google test has a flag for generating reports: > {{--gtest_output=xml:report.xml}} > Jenkins can display these reports via the xUnit plugin, which has support for > google test xml: https://wiki.jenkins-ci.org/display/JENKINS/xUnit+Plugin > This lets us quickly see which test failed, as well as the time that each > test took to run. > We should wire this up. One difficulty is that 'make distclean' complains > because the .xml files are left over (we could update distclean to wipe any > .xml files within the test locations): > {noformat} > ERROR: files left in build directory after distclean: > ./3rdparty/libprocess/3rdparty/report.xml > ./3rdparty/libprocess/report.xml > ./src/report.xml > make[1]: *** [distcleancheck] Error 1 > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4258) Generate xml test reports in the jenkins build.
[ https://issues.apache.org/jira/browse/MESOS-4258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15090356#comment-15090356 ] Shuai Lin commented on MESOS-4258: -- Updated the jenkins build script to copy out xml testing reports. https://reviews.apache.org/r/42100/ > Generate xml test reports in the jenkins build. > --- > > Key: MESOS-4258 > URL: https://issues.apache.org/jira/browse/MESOS-4258 > Project: Mesos > Issue Type: Task > Components: test >Reporter: Benjamin Mahler >Assignee: Shuai Lin > Labels: newbie > > Google test has a flag for generating reports: > {{--gtest_output=xml:report.xml}} > Jenkins can display these reports via the xUnit plugin, which has support for > google test xml: https://wiki.jenkins-ci.org/display/JENKINS/xUnit+Plugin > This lets us quickly see which test failed, as well as the time that each > test took to run. > We should wire this up. One difficulty is that 'make distclean' complains > because the .xml files are left over (we could update distclean to wipe any > .xml files within the test locations): > {noformat} > ERROR: files left in build directory after distclean: > ./3rdparty/libprocess/3rdparty/report.xml > ./3rdparty/libprocess/report.xml > ./src/report.xml > make[1]: *** [distcleancheck] Error 1 > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-4301) Accepting an inverse offer prints misleading logs
[ https://issues.apache.org/jira/browse/MESOS-4301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15090173#comment-15090173 ] Joseph Wu edited comment on MESOS-4301 at 1/9/16 1:57 AM: -- Review to: * Fix the logging. * Fix the bug found above. * Refactor {{Master::accept}} to read more sequentially. https://reviews.apache.org/r/42086/ was (Author: kaysoky): Review to fix the logging and the regression test above: https://reviews.apache.org/r/42086/ > Accepting an inverse offer prints misleading logs > - > > Key: MESOS-4301 > URL: https://issues.apache.org/jira/browse/MESOS-4301 > Project: Mesos > Issue Type: Bug > Components: master >Affects Versions: 0.25.0 >Reporter: Joseph Wu >Assignee: Joseph Wu > Labels: log, maintenance, mesosphere > > Whenever a scheduler accepts an inverse offer, Mesos will print a line like > this in the master logs: > {code} > W1125 10:05:53.155109 29362 master.cpp:2897] ACCEPT call used invalid offers > '[ 932f7d7b-f2d4-42c7-9391-222c19b9d35b-O2 ]': Offer > 932f7d7b-f2d4-42c7-9391-222c19b9d35b-O2 is no longer valid > {code} > Inverse offers should not trigger this warning. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4258) Generate xml test reports in the jenkins build.
[ https://issues.apache.org/jira/browse/MESOS-4258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15090327#comment-15090327 ] Shuai Lin commented on MESOS-4258: -- [~bmahler] sure, I'll do that. > Generate xml test reports in the jenkins build. > --- > > Key: MESOS-4258 > URL: https://issues.apache.org/jira/browse/MESOS-4258 > Project: Mesos > Issue Type: Task > Components: test >Reporter: Benjamin Mahler >Assignee: Shuai Lin > Labels: newbie > > Google test has a flag for generating reports: > {{--gtest_output=xml:report.xml}} > Jenkins can display these reports via the xUnit plugin, which has support for > google test xml: https://wiki.jenkins-ci.org/display/JENKINS/xUnit+Plugin > This lets us quickly see which test failed, as well as the time that each > test took to run. > We should wire this up. One difficulty is that 'make distclean' complains > because the .xml files are left over (we could update distclean to wipe any > .xml files within the test locations): > {noformat} > ERROR: files left in build directory after distclean: > ./3rdparty/libprocess/3rdparty/report.xml > ./3rdparty/libprocess/report.xml > ./src/report.xml > make[1]: *** [distcleancheck] Error 1 > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4249) Mesos fetcher step skipped with MESOS_DOCKER_MESOS_IMAGE flag
[ https://issues.apache.org/jira/browse/MESOS-4249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shuai Lin updated MESOS-4249: - Shepherd: Timothy Chen > Mesos fetcher step skipped with MESOS_DOCKER_MESOS_IMAGE flag > - > > Key: MESOS-4249 > URL: https://issues.apache.org/jira/browse/MESOS-4249 > Project: Mesos > Issue Type: Bug > Components: slave >Affects Versions: 0.26.0 > Environment: mesos 0.26.0-0.2.145.ubuntu1404 >Reporter: Marica Antonacci >Assignee: Shuai Lin > > The following behaviour has been observed using a dockerized mesos slave. > If the slave is running inside a docker container with the docker_mesos_image > startup flag and you submit the deployment of a dockerized application or job > (through Marathon/Chronos), the fetcher step is not performed. On the other > hand, if you request the deployment of a non-dockerized application, the URIs > are correctly fetched. Moreover, if I don’t provide the docker_mesos_image > flag, the fetcher works fine again for both dockerized and non-dockerized > applications. > More details in the user mailing list > (https://www.mail-archive.com/user@mesos.apache.org/msg05429.html). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4318) PersistentVolumeTest.BadACLNoPrincipal is flaky
[ https://issues.apache.org/jira/browse/MESOS-4318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15090311#comment-15090311 ] Greg Mann commented on MESOS-4318: -- Review here: https://reviews.apache.org/r/42096/ > PersistentVolumeTest.BadACLNoPrincipal is flaky > --- > > Key: MESOS-4318 > URL: https://issues.apache.org/jira/browse/MESOS-4318 > Project: Mesos > Issue Type: Bug >Reporter: Jie Yu >Assignee: Greg Mann > Labels: flaky-test > > https://builds.apache.org/job/Mesos/1457/COMPILER=gcc,CONFIGURATION=--verbose%20--enable-libevent%20--enable-ssl,OS=centos:7,label_exp=docker%7C%7CHadoop/consoleFull > {noformat} > [ RUN ] PersistentVolumeTest.BadACLNoPrincipal > I0108 01:13:16.117883 1325 leveldb.cpp:174] Opened db in 2.614722ms > I0108 01:13:16.118650 1325 leveldb.cpp:181] Compacted db in 706567ns > I0108 01:13:16.118702 1325 leveldb.cpp:196] Created db iterator in 24489ns > I0108 01:13:16.118723 1325 leveldb.cpp:202] Seeked to beginning of db in > 2436ns > I0108 01:13:16.118738 1325 leveldb.cpp:271] Iterated through 0 keys in the > db in 397ns > I0108 01:13:16.118793 1325 replica.cpp:779] Replica recovered with log > positions 0 -> 0 with 1 holes and 0 unlearned > I0108 01:13:16.119627 1348 recover.cpp:447] Starting replica recovery > I0108 01:13:16.120352 1348 recover.cpp:473] Replica is in EMPTY status > I0108 01:13:16.121750 1357 replica.cpp:673] Replica in EMPTY status received > a broadcasted recover request from (7084)@172.17.0.2:32801 > I0108 01:13:16.122297 1353 recover.cpp:193] Received a recover response from > a replica in EMPTY status > I0108 01:13:16.122747 1350 recover.cpp:564] Updating replica status to > STARTING > I0108 01:13:16.123625 1354 master.cpp:365] Master > 773d31e8-383d-4e4b-aa68-f9a3fb9f1fc2 (d9632dd1c41e) started on > 172.17.0.2:32801 > I0108 01:13:16.123946 1347 leveldb.cpp:304] Persisting metadata (8 bytes) to > leveldb took 728242ns > I0108 01:13:16.123999 1347 replica.cpp:320] Persisted replica status to > STARTING > I0108 01:13:16.123708 1354 master.cpp:367] Flags at startup: > --acls="create_volumes { > principals { > values: "test-principal" > } > volume_types { > type: ANY > } > } > create_volumes { > principals { > type: ANY > } > volume_types { > type: NONE > } > } > " --allocation_interval="1secs" --allocator="HierarchicalDRF" > --authenticate="false" --authenticate_slaves="true" > --authenticators="crammd5" --authorizers="local" > --credentials="/tmp/f2rA75/credentials" --framework_sorter="drf" > --help="false" --hostname_lookup="true" --initialize_driver_logging="true" > --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" > --max_slave_ping_timeouts="5" --quiet="false" > --recovery_slave_removal_limit="100%" --registry="replicated_log" > --registry_fetch_timeout="1mins" --registry_store_timeout="25secs" > --registry_strict="true" --roles="role1" --root_submissions="true" > --slave_ping_timeout="15secs" --slave_reregister_timeout="10mins" > --user_sorter="drf" --version="false" > --webui_dir="/mesos/mesos-0.27.0/_inst/share/mesos/webui" > --work_dir="/tmp/f2rA75/master" --zk_session_timeout="10secs" > I0108 01:13:16.124219 1354 master.cpp:414] Master allowing unauthenticated > frameworks to register > I0108 01:13:16.124236 1354 master.cpp:417] Master only allowing > authenticated slaves to register > I0108 01:13:16.124248 1354 credentials.hpp:35] Loading credentials for > authentication from '/tmp/f2rA75/credentials' > I0108 01:13:16.124294 1358 recover.cpp:473] Replica is in STARTING status > I0108 01:13:16.124644 1354 master.cpp:456] Using default 'crammd5' > authenticator > I0108 01:13:16.124820 1354 master.cpp:493] Authorization enabled > W0108 01:13:16.124843 1354 master.cpp:553] The '--roles' flag is deprecated. > This flag will be removed in the future. See the Mesos 0.27 upgrade notes for > more information > I0108 01:13:16.125154 1348 hierarchical.cpp:147] Initialized hierarchical > allocator process > I0108 01:13:16.125334 1345 whitelist_watcher.cpp:77] No whitelist given > I0108 01:13:16.126065 1346 replica.cpp:673] Replica in STARTING status > received a broadcasted recover request from (7085)@172.17.0.2:32801 > I0108 01:13:16.126806 1348 recover.cpp:193] Received a recover response from > a replica in STARTING status > I0108 01:13:16.128237 1354 recover.cpp:564] Updating replica status to VOTING > I0108 01:13:16.128402 1359 master.cpp:1629] The newly elected leader is > master@172.17.0.2:32801 with id 773d31e8-383d-4e4b-aa68-f9a3fb9f1fc2 > I0108 01:13:16.128489 1359 master.cpp:1642] Elected as the leading master! > I0108 01:13:16.128523 1359 master.cpp:1387] Recovering from registrar > I0108 01:13:16.128756 1355 registr
[jira] [Commented] (MESOS-4258) Generate xml test reports in the jenkins build.
[ https://issues.apache.org/jira/browse/MESOS-4258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15090303#comment-15090303 ] Benjamin Mahler commented on MESOS-4258: Your patch is committed, so now the report files are generated. The next part is to process the reports in jenkins. I think we'll want to use '[docker cp|https://docs.docker.com/engine/reference/commandline/cp/]' to copy out the report files from the container to the jenkins workspace. This likely means removing {{--rm}} from our {{docker run}} invocation and placing the rm command within the EXIT trap. [~lins05] can you do this next part as well? > Generate xml test reports in the jenkins build. > --- > > Key: MESOS-4258 > URL: https://issues.apache.org/jira/browse/MESOS-4258 > Project: Mesos > Issue Type: Task > Components: test >Reporter: Benjamin Mahler >Assignee: Shuai Lin > Labels: newbie > > Google test has a flag for generating reports: > {{--gtest_output=xml:report.xml}} > Jenkins can display these reports via the xUnit plugin, which has support for > google test xml: https://wiki.jenkins-ci.org/display/JENKINS/xUnit+Plugin > This lets us quickly see which test failed, as well as the time that each > test took to run. > We should wire this up. One difficulty is that 'make distclean' complains > because the .xml files are left over (we could update distclean to wipe any > .xml files within the test locations): > {noformat} > ERROR: files left in build directory after distclean: > ./3rdparty/libprocess/3rdparty/report.xml > ./3rdparty/libprocess/report.xml > ./src/report.xml > make[1]: *** [distcleancheck] Error 1 > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3472) RegistryTokenTest.ExpiredToken test is flaky
[ https://issues.apache.org/jira/browse/MESOS-3472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15090226#comment-15090226 ] Neil Conway commented on MESOS-3472: Good point -- some initial experiments seem to confirm that moving the clock backwards in {{MesosTest::TearDown}} will not be trivial. I guess we should do #1 or #2 for now. > RegistryTokenTest.ExpiredToken test is flaky > > > Key: MESOS-3472 > URL: https://issues.apache.org/jira/browse/MESOS-3472 > Project: Mesos > Issue Type: Bug >Reporter: Artem Harutyunyan >Assignee: Neil Conway > Labels: flaky, mesosphere > > RegistryTokenTest.ExpiredToken test is flaky. Here is the error I got on OSX > after running it for several times: > {noformat} > [ RUN ] RegistryTokenTest.ExpiredToken > ../../src/tests/containerizer/provisioner_docker_tests.cpp:167: Failure > Value of: token.isError() > Actual: false > Expected: true > libc++abi.dylib: terminating with uncaught exception of type > testing::internal::GoogleTestFailureException: > ../../src/tests/containerizer/provisioner_docker_tests.cpp:167: Failure > Value of: token.isError() > Actual: false > Expected: true > *** Aborted at 1442708631 (unix time) try "date -d @1442708631" if you are > using GNU date *** > PC: @ 0x7fff925fd286 __pthread_kill > *** SIGABRT (@0x7fff925fd286) received by PID 7082 (TID 0x7fff7d7ad300) stack > trace: *** > @ 0x7fff9041af1a _sigtramp > @ 0x7fff59759968 (unknown) > @ 0x7fff9bb429b3 abort > @ 0x7fff90ce1a21 abort_message > @ 0x7fff90d099b9 default_terminate_handler() > @ 0x7fff994767eb _objc_terminate() > @ 0x7fff90d070a1 std::__terminate() > @ 0x7fff90d06d48 __cxa_rethrow > @0x10781bb16 > testing::internal::HandleExceptionsInMethodIfSupported<>() > @0x1077e9d30 testing::UnitTest::Run() > @0x106d59a91 RUN_ALL_TESTS() > @0x106d55d47 main > @ 0x7fff8fc395c9 start > @0x3 (unknown) > Abort trap: 6 > ~/src/mesos/build ((3ee82e3...)) $ > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3472) RegistryTokenTest.ExpiredToken test is flaky
[ https://issues.apache.org/jira/browse/MESOS-3472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15090209#comment-15090209 ] Joseph Wu commented on MESOS-3472: -- \#4 may be difficult to do without some weird side-effects. I implemented some clock cleanup here: [MESOS-3882]. But rolling back the clock could potentially leave some timers active. Ideally, we'd want to clean up all of libprocess's state in between tests that reset the clock. You can look through some of the reviews out for doing just that: [MESOS-3820]. > RegistryTokenTest.ExpiredToken test is flaky > > > Key: MESOS-3472 > URL: https://issues.apache.org/jira/browse/MESOS-3472 > Project: Mesos > Issue Type: Bug >Reporter: Artem Harutyunyan >Assignee: Neil Conway > Labels: flaky, mesosphere > > RegistryTokenTest.ExpiredToken test is flaky. Here is the error I got on OSX > after running it for several times: > {noformat} > [ RUN ] RegistryTokenTest.ExpiredToken > ../../src/tests/containerizer/provisioner_docker_tests.cpp:167: Failure > Value of: token.isError() > Actual: false > Expected: true > libc++abi.dylib: terminating with uncaught exception of type > testing::internal::GoogleTestFailureException: > ../../src/tests/containerizer/provisioner_docker_tests.cpp:167: Failure > Value of: token.isError() > Actual: false > Expected: true > *** Aborted at 1442708631 (unix time) try "date -d @1442708631" if you are > using GNU date *** > PC: @ 0x7fff925fd286 __pthread_kill > *** SIGABRT (@0x7fff925fd286) received by PID 7082 (TID 0x7fff7d7ad300) stack > trace: *** > @ 0x7fff9041af1a _sigtramp > @ 0x7fff59759968 (unknown) > @ 0x7fff9bb429b3 abort > @ 0x7fff90ce1a21 abort_message > @ 0x7fff90d099b9 default_terminate_handler() > @ 0x7fff994767eb _objc_terminate() > @ 0x7fff90d070a1 std::__terminate() > @ 0x7fff90d06d48 __cxa_rethrow > @0x10781bb16 > testing::internal::HandleExceptionsInMethodIfSupported<>() > @0x1077e9d30 testing::UnitTest::Run() > @0x106d59a91 RUN_ALL_TESTS() > @0x106d55d47 main > @ 0x7fff8fc395c9 start > @0x3 (unknown) > Abort trap: 6 > ~/src/mesos/build ((3ee82e3...)) $ > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4229) Docker containers left running on disk after reviewbot builds
[ https://issues.apache.org/jira/browse/MESOS-4229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15090192#comment-15090192 ] Greg Mann commented on MESOS-4229: -- The issue may be due to the occurrence of a hung build. Jenkins should kill such a build after a specified period, but in that case perhaps Docker cleanup doesn't occur as normal. > Docker containers left running on disk after reviewbot builds > - > > Key: MESOS-4229 > URL: https://issues.apache.org/jira/browse/MESOS-4229 > Project: Mesos > Issue Type: Bug > Environment: ASF Mesos Reviewbot >Reporter: Greg Mann > Labels: build, mesosphere, test > > The Mesos Reviewbot builds recently failed due to Docker containers being > left running on the disk, eventually leading to a full disk: > https://issues.apache.org/jira/browse/INFRA-10984 > These containers should be automatically cleaned up to avoid this problem in > the future. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4229) Docker containers left running on disk after reviewbot builds
[ https://issues.apache.org/jira/browse/MESOS-4229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15090180#comment-15090180 ] Jojy Varghese commented on MESOS-4229: -- We use *--rm* flag when launching *docker run*. I thought that was sufficient for cleanups. > Docker containers left running on disk after reviewbot builds > - > > Key: MESOS-4229 > URL: https://issues.apache.org/jira/browse/MESOS-4229 > Project: Mesos > Issue Type: Bug > Environment: ASF Mesos Reviewbot >Reporter: Greg Mann > Labels: build, mesosphere, test > > The Mesos Reviewbot builds recently failed due to Docker containers being > left running on the disk, eventually leading to a full disk: > https://issues.apache.org/jira/browse/INFRA-10984 > These containers should be automatically cleaned up to avoid this problem in > the future. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4301) Accepting an inverse offer prints misleading logs
[ https://issues.apache.org/jira/browse/MESOS-4301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15090173#comment-15090173 ] Joseph Wu commented on MESOS-4301: -- Review to fix the logging and the regression test above: https://reviews.apache.org/r/42086/ > Accepting an inverse offer prints misleading logs > - > > Key: MESOS-4301 > URL: https://issues.apache.org/jira/browse/MESOS-4301 > Project: Mesos > Issue Type: Bug > Components: master >Affects Versions: 0.25.0 >Reporter: Joseph Wu >Assignee: Joseph Wu > Labels: log, maintenance, mesosphere > > Whenever a scheduler accepts an inverse offer, Mesos will print a line like > this in the master logs: > {code} > W1125 10:05:53.155109 29362 master.cpp:2897] ACCEPT call used invalid offers > '[ 932f7d7b-f2d4-42c7-9391-222c19b9d35b-O2 ]': Offer > 932f7d7b-f2d4-42c7-9391-222c19b9d35b-O2 is no longer valid > {code} > Inverse offers should not trigger this warning. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-4289) Design doc for simple appc image discovery
[ https://issues.apache.org/jira/browse/MESOS-4289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15087847#comment-15087847 ] Jojy Varghese edited comment on MESOS-4289 at 1/8/16 11:25 PM: --- https://docs.google.com/document/d/1EeL4JApd2-cW6p3xdBatOc9foT3W3E5atQLJ2iVj5Ow/edit?usp=sharing was (Author: jojy): https://docs.google.com/document/d/1EeL4JApd2-cW6p3xdBatOc9foT3W3E5atQLJ2iVj5Ow/edit#heading=h.xof8uidxnjzv > Design doc for simple appc image discovery > -- > > Key: MESOS-4289 > URL: https://issues.apache.org/jira/browse/MESOS-4289 > Project: Mesos > Issue Type: Task > Components: containerization >Reporter: Jojy Varghese >Assignee: Jojy Varghese > Labels: mesosphere > > Create a design document describing the following: > - Model and abstraction of the Discoverer > - Workflow of the discovery process -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-3421) Support sharing of resources across task instances
[ https://issues.apache.org/jira/browse/MESOS-3421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adam B updated MESOS-3421: -- Shepherd: Adam B I'll volunteer to shepherd this. cc: [~anandmazumdar] who wanted to help review/implement. > Support sharing of resources across task instances > -- > > Key: MESOS-3421 > URL: https://issues.apache.org/jira/browse/MESOS-3421 > Project: Mesos > Issue Type: Improvement > Components: general >Affects Versions: 0.23.0 >Reporter: Anindya Sinha >Assignee: Anindya Sinha > Labels: external-volumes, persistent-volumes > > A service that needs persistent volume needs to have access to the same > persistent volume (RW) from multiple task(s) instances on the same agent > node. Currently, a persistent volume once offered to the framework(s) can be > scheduled to a task and until that tasks terminates, that persistent volume > cannot be used by another task. > Explore providing the capability of sharing persistent volumes across task > instances scheduled on a single agent node. > Based on discussion within the community, we would allow sharing of resources > in general, and add support to enable shareability for persistent volumes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4301) Accepting an inverse offer prints misleading logs
[ https://issues.apache.org/jira/browse/MESOS-4301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15090140#comment-15090140 ] Joseph Wu commented on MESOS-4301: -- While fixing this log line, found another bug. Essentially: # {{validation::offer::validate}} returns an error when an {{InverseOffer}} is accepted. # If an {{Offer}} is part of the same {{Call::ACCEPT}}, the master sees {{error.isSome()}} and returns a {{TASK_LOST}} for normal offers. (https://github.com/apache/mesos/blob/fafbdca610d0a150b9fa9cb62d1c63cb7a6fdaf3/src/master/master.cpp#L3117) Regression test: https://reviews.apache.org/r/42092/ > Accepting an inverse offer prints misleading logs > - > > Key: MESOS-4301 > URL: https://issues.apache.org/jira/browse/MESOS-4301 > Project: Mesos > Issue Type: Bug > Components: master >Affects Versions: 0.25.0 >Reporter: Joseph Wu >Assignee: Joseph Wu > Labels: log, maintenance, mesosphere > > Whenever a scheduler accepts an inverse offer, Mesos will print a line like > this in the master logs: > {code} > W1125 10:05:53.155109 29362 master.cpp:2897] ACCEPT call used invalid offers > '[ 932f7d7b-f2d4-42c7-9391-222c19b9d35b-O2 ]': Offer > 932f7d7b-f2d4-42c7-9391-222c19b9d35b-O2 is no longer valid > {code} > Inverse offers should not trigger this warning. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3472) RegistryTokenTest.ExpiredToken test is flaky
[ https://issues.apache.org/jira/browse/MESOS-3472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15090037#comment-15090037 ] Neil Conway commented on MESOS-3472: Weird: Clock::now.secs of 1483973693 is a time in 2017. Looking into this, the problem seems to be that every time we run the test suite, we {{Clock::advance}} by about 4 weeks. So if you run the entire test suite with {{gtest_repeat}} set to ~12 or more, we'll eventually move the clock forward one year, which means the token we create in {{RegistryTokenTest.ExpiredToken}} will no longer be expired and the test will fail. Possible fixes: 1. Have {{RegistryTokenTest.ExpiredToken}} use an offset of more than 1 year. Obviously this is kludgy. 2. Have {{RegistryTokenTest.ExpiredToken}} use a fixed time in the past, rather than picking one relative to {{Clock::now}}. Again, somewhat kludgy, although better than #1. 3. Have {{MesosTest::TearDown}} reset the clock (via {{Clock::update}}) to some "initial" value. Right now we don't capture an appropriate initial value, however. 4. Introduce {{Clock::resetAdvance()}} which clears the effect of any {{Clock::advance}} calls, and then invoke this in {{MesosTest::TearDown}}. I'm inclined to do #4. > RegistryTokenTest.ExpiredToken test is flaky > > > Key: MESOS-3472 > URL: https://issues.apache.org/jira/browse/MESOS-3472 > Project: Mesos > Issue Type: Bug >Reporter: Artem Harutyunyan >Assignee: Neil Conway > Labels: flaky, mesosphere > > RegistryTokenTest.ExpiredToken test is flaky. Here is the error I got on OSX > after running it for several times: > {noformat} > [ RUN ] RegistryTokenTest.ExpiredToken > ../../src/tests/containerizer/provisioner_docker_tests.cpp:167: Failure > Value of: token.isError() > Actual: false > Expected: true > libc++abi.dylib: terminating with uncaught exception of type > testing::internal::GoogleTestFailureException: > ../../src/tests/containerizer/provisioner_docker_tests.cpp:167: Failure > Value of: token.isError() > Actual: false > Expected: true > *** Aborted at 1442708631 (unix time) try "date -d @1442708631" if you are > using GNU date *** > PC: @ 0x7fff925fd286 __pthread_kill > *** SIGABRT (@0x7fff925fd286) received by PID 7082 (TID 0x7fff7d7ad300) stack > trace: *** > @ 0x7fff9041af1a _sigtramp > @ 0x7fff59759968 (unknown) > @ 0x7fff9bb429b3 abort > @ 0x7fff90ce1a21 abort_message > @ 0x7fff90d099b9 default_terminate_handler() > @ 0x7fff994767eb _objc_terminate() > @ 0x7fff90d070a1 std::__terminate() > @ 0x7fff90d06d48 __cxa_rethrow > @0x10781bb16 > testing::internal::HandleExceptionsInMethodIfSupported<>() > @0x1077e9d30 testing::UnitTest::Run() > @0x106d59a91 RUN_ALL_TESTS() > @0x106d55d47 main > @ 0x7fff8fc395c9 start > @0x3 (unknown) > Abort trap: 6 > ~/src/mesos/build ((3ee82e3...)) $ > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3003) Support mounting in default configuration files/volumes into every new container
[ https://issues.apache.org/jira/browse/MESOS-3003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15089979#comment-15089979 ] Timothy Chen commented on MESOS-3003: - I think following what libcontainer/runc does, we should create a list of etc files to mount (ie: etc/host and etc/resolv.conf) in the container when we see that /etc is not mounted already from the host. For now I think this should be suffice and we need to test different containers to see is there any more configuration files that we need to pass in. > Support mounting in default configuration files/volumes into every new > container > > > Key: MESOS-3003 > URL: https://issues.apache.org/jira/browse/MESOS-3003 > Project: Mesos > Issue Type: Improvement > Components: containerization >Reporter: Timothy Chen > Labels: mesosphere, unified-containerizer-mvp > > Most container images leave out system configuration (e.g: /etc/*) and expect > the container runtimes to mount in specific configurations as needed such as > /etc/resolv.conf from the host into the container when needed. > We need to support mounting in specific configuration files for command > executor to work, and also allow the user to optionally define other > configuration files to mount in as well via flags. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4318) PersistentVolumeTest.BadACLNoPrincipal is flaky
[ https://issues.apache.org/jira/browse/MESOS-4318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jie Yu updated MESOS-4318: -- Labels: flaky-test (was: ) > PersistentVolumeTest.BadACLNoPrincipal is flaky > --- > > Key: MESOS-4318 > URL: https://issues.apache.org/jira/browse/MESOS-4318 > Project: Mesos > Issue Type: Bug >Reporter: Jie Yu > Labels: flaky-test > > https://builds.apache.org/job/Mesos/1457/COMPILER=gcc,CONFIGURATION=--verbose%20--enable-libevent%20--enable-ssl,OS=centos:7,label_exp=docker%7C%7CHadoop/consoleFull > {noformat} > [ RUN ] PersistentVolumeTest.BadACLNoPrincipal > I0108 01:13:16.117883 1325 leveldb.cpp:174] Opened db in 2.614722ms > I0108 01:13:16.118650 1325 leveldb.cpp:181] Compacted db in 706567ns > I0108 01:13:16.118702 1325 leveldb.cpp:196] Created db iterator in 24489ns > I0108 01:13:16.118723 1325 leveldb.cpp:202] Seeked to beginning of db in > 2436ns > I0108 01:13:16.118738 1325 leveldb.cpp:271] Iterated through 0 keys in the > db in 397ns > I0108 01:13:16.118793 1325 replica.cpp:779] Replica recovered with log > positions 0 -> 0 with 1 holes and 0 unlearned > I0108 01:13:16.119627 1348 recover.cpp:447] Starting replica recovery > I0108 01:13:16.120352 1348 recover.cpp:473] Replica is in EMPTY status > I0108 01:13:16.121750 1357 replica.cpp:673] Replica in EMPTY status received > a broadcasted recover request from (7084)@172.17.0.2:32801 > I0108 01:13:16.122297 1353 recover.cpp:193] Received a recover response from > a replica in EMPTY status > I0108 01:13:16.122747 1350 recover.cpp:564] Updating replica status to > STARTING > I0108 01:13:16.123625 1354 master.cpp:365] Master > 773d31e8-383d-4e4b-aa68-f9a3fb9f1fc2 (d9632dd1c41e) started on > 172.17.0.2:32801 > I0108 01:13:16.123946 1347 leveldb.cpp:304] Persisting metadata (8 bytes) to > leveldb took 728242ns > I0108 01:13:16.123999 1347 replica.cpp:320] Persisted replica status to > STARTING > I0108 01:13:16.123708 1354 master.cpp:367] Flags at startup: > --acls="create_volumes { > principals { > values: "test-principal" > } > volume_types { > type: ANY > } > } > create_volumes { > principals { > type: ANY > } > volume_types { > type: NONE > } > } > " --allocation_interval="1secs" --allocator="HierarchicalDRF" > --authenticate="false" --authenticate_slaves="true" > --authenticators="crammd5" --authorizers="local" > --credentials="/tmp/f2rA75/credentials" --framework_sorter="drf" > --help="false" --hostname_lookup="true" --initialize_driver_logging="true" > --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" > --max_slave_ping_timeouts="5" --quiet="false" > --recovery_slave_removal_limit="100%" --registry="replicated_log" > --registry_fetch_timeout="1mins" --registry_store_timeout="25secs" > --registry_strict="true" --roles="role1" --root_submissions="true" > --slave_ping_timeout="15secs" --slave_reregister_timeout="10mins" > --user_sorter="drf" --version="false" > --webui_dir="/mesos/mesos-0.27.0/_inst/share/mesos/webui" > --work_dir="/tmp/f2rA75/master" --zk_session_timeout="10secs" > I0108 01:13:16.124219 1354 master.cpp:414] Master allowing unauthenticated > frameworks to register > I0108 01:13:16.124236 1354 master.cpp:417] Master only allowing > authenticated slaves to register > I0108 01:13:16.124248 1354 credentials.hpp:35] Loading credentials for > authentication from '/tmp/f2rA75/credentials' > I0108 01:13:16.124294 1358 recover.cpp:473] Replica is in STARTING status > I0108 01:13:16.124644 1354 master.cpp:456] Using default 'crammd5' > authenticator > I0108 01:13:16.124820 1354 master.cpp:493] Authorization enabled > W0108 01:13:16.124843 1354 master.cpp:553] The '--roles' flag is deprecated. > This flag will be removed in the future. See the Mesos 0.27 upgrade notes for > more information > I0108 01:13:16.125154 1348 hierarchical.cpp:147] Initialized hierarchical > allocator process > I0108 01:13:16.125334 1345 whitelist_watcher.cpp:77] No whitelist given > I0108 01:13:16.126065 1346 replica.cpp:673] Replica in STARTING status > received a broadcasted recover request from (7085)@172.17.0.2:32801 > I0108 01:13:16.126806 1348 recover.cpp:193] Received a recover response from > a replica in STARTING status > I0108 01:13:16.128237 1354 recover.cpp:564] Updating replica status to VOTING > I0108 01:13:16.128402 1359 master.cpp:1629] The newly elected leader is > master@172.17.0.2:32801 with id 773d31e8-383d-4e4b-aa68-f9a3fb9f1fc2 > I0108 01:13:16.128489 1359 master.cpp:1642] Elected as the leading master! > I0108 01:13:16.128523 1359 master.cpp:1387] Recovering from registrar > I0108 01:13:16.128756 1355 registrar.cpp:307] Recovering registrar > I0108 01:13:16.129259 1344 leveldb.cpp:304] Persisting metadata (8 bytes) to
[jira] [Updated] (MESOS-4318) PersistentVolumeTest.BadACLNoPrincipal is flaky
[ https://issues.apache.org/jira/browse/MESOS-4318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jie Yu updated MESOS-4318: -- Shepherd: Jie Yu Sprint: Mesosphere Sprint 26 > PersistentVolumeTest.BadACLNoPrincipal is flaky > --- > > Key: MESOS-4318 > URL: https://issues.apache.org/jira/browse/MESOS-4318 > Project: Mesos > Issue Type: Bug >Reporter: Jie Yu >Assignee: Greg Mann > Labels: flaky-test > > https://builds.apache.org/job/Mesos/1457/COMPILER=gcc,CONFIGURATION=--verbose%20--enable-libevent%20--enable-ssl,OS=centos:7,label_exp=docker%7C%7CHadoop/consoleFull > {noformat} > [ RUN ] PersistentVolumeTest.BadACLNoPrincipal > I0108 01:13:16.117883 1325 leveldb.cpp:174] Opened db in 2.614722ms > I0108 01:13:16.118650 1325 leveldb.cpp:181] Compacted db in 706567ns > I0108 01:13:16.118702 1325 leveldb.cpp:196] Created db iterator in 24489ns > I0108 01:13:16.118723 1325 leveldb.cpp:202] Seeked to beginning of db in > 2436ns > I0108 01:13:16.118738 1325 leveldb.cpp:271] Iterated through 0 keys in the > db in 397ns > I0108 01:13:16.118793 1325 replica.cpp:779] Replica recovered with log > positions 0 -> 0 with 1 holes and 0 unlearned > I0108 01:13:16.119627 1348 recover.cpp:447] Starting replica recovery > I0108 01:13:16.120352 1348 recover.cpp:473] Replica is in EMPTY status > I0108 01:13:16.121750 1357 replica.cpp:673] Replica in EMPTY status received > a broadcasted recover request from (7084)@172.17.0.2:32801 > I0108 01:13:16.122297 1353 recover.cpp:193] Received a recover response from > a replica in EMPTY status > I0108 01:13:16.122747 1350 recover.cpp:564] Updating replica status to > STARTING > I0108 01:13:16.123625 1354 master.cpp:365] Master > 773d31e8-383d-4e4b-aa68-f9a3fb9f1fc2 (d9632dd1c41e) started on > 172.17.0.2:32801 > I0108 01:13:16.123946 1347 leveldb.cpp:304] Persisting metadata (8 bytes) to > leveldb took 728242ns > I0108 01:13:16.123999 1347 replica.cpp:320] Persisted replica status to > STARTING > I0108 01:13:16.123708 1354 master.cpp:367] Flags at startup: > --acls="create_volumes { > principals { > values: "test-principal" > } > volume_types { > type: ANY > } > } > create_volumes { > principals { > type: ANY > } > volume_types { > type: NONE > } > } > " --allocation_interval="1secs" --allocator="HierarchicalDRF" > --authenticate="false" --authenticate_slaves="true" > --authenticators="crammd5" --authorizers="local" > --credentials="/tmp/f2rA75/credentials" --framework_sorter="drf" > --help="false" --hostname_lookup="true" --initialize_driver_logging="true" > --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" > --max_slave_ping_timeouts="5" --quiet="false" > --recovery_slave_removal_limit="100%" --registry="replicated_log" > --registry_fetch_timeout="1mins" --registry_store_timeout="25secs" > --registry_strict="true" --roles="role1" --root_submissions="true" > --slave_ping_timeout="15secs" --slave_reregister_timeout="10mins" > --user_sorter="drf" --version="false" > --webui_dir="/mesos/mesos-0.27.0/_inst/share/mesos/webui" > --work_dir="/tmp/f2rA75/master" --zk_session_timeout="10secs" > I0108 01:13:16.124219 1354 master.cpp:414] Master allowing unauthenticated > frameworks to register > I0108 01:13:16.124236 1354 master.cpp:417] Master only allowing > authenticated slaves to register > I0108 01:13:16.124248 1354 credentials.hpp:35] Loading credentials for > authentication from '/tmp/f2rA75/credentials' > I0108 01:13:16.124294 1358 recover.cpp:473] Replica is in STARTING status > I0108 01:13:16.124644 1354 master.cpp:456] Using default 'crammd5' > authenticator > I0108 01:13:16.124820 1354 master.cpp:493] Authorization enabled > W0108 01:13:16.124843 1354 master.cpp:553] The '--roles' flag is deprecated. > This flag will be removed in the future. See the Mesos 0.27 upgrade notes for > more information > I0108 01:13:16.125154 1348 hierarchical.cpp:147] Initialized hierarchical > allocator process > I0108 01:13:16.125334 1345 whitelist_watcher.cpp:77] No whitelist given > I0108 01:13:16.126065 1346 replica.cpp:673] Replica in STARTING status > received a broadcasted recover request from (7085)@172.17.0.2:32801 > I0108 01:13:16.126806 1348 recover.cpp:193] Received a recover response from > a replica in STARTING status > I0108 01:13:16.128237 1354 recover.cpp:564] Updating replica status to VOTING > I0108 01:13:16.128402 1359 master.cpp:1629] The newly elected leader is > master@172.17.0.2:32801 with id 773d31e8-383d-4e4b-aa68-f9a3fb9f1fc2 > I0108 01:13:16.128489 1359 master.cpp:1642] Elected as the leading master! > I0108 01:13:16.128523 1359 master.cpp:1387] Recovering from registrar > I0108 01:13:16.128756 1355 registrar.cpp:307] Recovering registrar > I0108 01:13:16.129259
[jira] [Updated] (MESOS-4318) PersistentVolumeTest.BadACLNoPrincipal is flaky
[ https://issues.apache.org/jira/browse/MESOS-4318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jie Yu updated MESOS-4318: -- Assignee: Greg Mann > PersistentVolumeTest.BadACLNoPrincipal is flaky > --- > > Key: MESOS-4318 > URL: https://issues.apache.org/jira/browse/MESOS-4318 > Project: Mesos > Issue Type: Bug >Reporter: Jie Yu >Assignee: Greg Mann > Labels: flaky-test > > https://builds.apache.org/job/Mesos/1457/COMPILER=gcc,CONFIGURATION=--verbose%20--enable-libevent%20--enable-ssl,OS=centos:7,label_exp=docker%7C%7CHadoop/consoleFull > {noformat} > [ RUN ] PersistentVolumeTest.BadACLNoPrincipal > I0108 01:13:16.117883 1325 leveldb.cpp:174] Opened db in 2.614722ms > I0108 01:13:16.118650 1325 leveldb.cpp:181] Compacted db in 706567ns > I0108 01:13:16.118702 1325 leveldb.cpp:196] Created db iterator in 24489ns > I0108 01:13:16.118723 1325 leveldb.cpp:202] Seeked to beginning of db in > 2436ns > I0108 01:13:16.118738 1325 leveldb.cpp:271] Iterated through 0 keys in the > db in 397ns > I0108 01:13:16.118793 1325 replica.cpp:779] Replica recovered with log > positions 0 -> 0 with 1 holes and 0 unlearned > I0108 01:13:16.119627 1348 recover.cpp:447] Starting replica recovery > I0108 01:13:16.120352 1348 recover.cpp:473] Replica is in EMPTY status > I0108 01:13:16.121750 1357 replica.cpp:673] Replica in EMPTY status received > a broadcasted recover request from (7084)@172.17.0.2:32801 > I0108 01:13:16.122297 1353 recover.cpp:193] Received a recover response from > a replica in EMPTY status > I0108 01:13:16.122747 1350 recover.cpp:564] Updating replica status to > STARTING > I0108 01:13:16.123625 1354 master.cpp:365] Master > 773d31e8-383d-4e4b-aa68-f9a3fb9f1fc2 (d9632dd1c41e) started on > 172.17.0.2:32801 > I0108 01:13:16.123946 1347 leveldb.cpp:304] Persisting metadata (8 bytes) to > leveldb took 728242ns > I0108 01:13:16.123999 1347 replica.cpp:320] Persisted replica status to > STARTING > I0108 01:13:16.123708 1354 master.cpp:367] Flags at startup: > --acls="create_volumes { > principals { > values: "test-principal" > } > volume_types { > type: ANY > } > } > create_volumes { > principals { > type: ANY > } > volume_types { > type: NONE > } > } > " --allocation_interval="1secs" --allocator="HierarchicalDRF" > --authenticate="false" --authenticate_slaves="true" > --authenticators="crammd5" --authorizers="local" > --credentials="/tmp/f2rA75/credentials" --framework_sorter="drf" > --help="false" --hostname_lookup="true" --initialize_driver_logging="true" > --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" > --max_slave_ping_timeouts="5" --quiet="false" > --recovery_slave_removal_limit="100%" --registry="replicated_log" > --registry_fetch_timeout="1mins" --registry_store_timeout="25secs" > --registry_strict="true" --roles="role1" --root_submissions="true" > --slave_ping_timeout="15secs" --slave_reregister_timeout="10mins" > --user_sorter="drf" --version="false" > --webui_dir="/mesos/mesos-0.27.0/_inst/share/mesos/webui" > --work_dir="/tmp/f2rA75/master" --zk_session_timeout="10secs" > I0108 01:13:16.124219 1354 master.cpp:414] Master allowing unauthenticated > frameworks to register > I0108 01:13:16.124236 1354 master.cpp:417] Master only allowing > authenticated slaves to register > I0108 01:13:16.124248 1354 credentials.hpp:35] Loading credentials for > authentication from '/tmp/f2rA75/credentials' > I0108 01:13:16.124294 1358 recover.cpp:473] Replica is in STARTING status > I0108 01:13:16.124644 1354 master.cpp:456] Using default 'crammd5' > authenticator > I0108 01:13:16.124820 1354 master.cpp:493] Authorization enabled > W0108 01:13:16.124843 1354 master.cpp:553] The '--roles' flag is deprecated. > This flag will be removed in the future. See the Mesos 0.27 upgrade notes for > more information > I0108 01:13:16.125154 1348 hierarchical.cpp:147] Initialized hierarchical > allocator process > I0108 01:13:16.125334 1345 whitelist_watcher.cpp:77] No whitelist given > I0108 01:13:16.126065 1346 replica.cpp:673] Replica in STARTING status > received a broadcasted recover request from (7085)@172.17.0.2:32801 > I0108 01:13:16.126806 1348 recover.cpp:193] Received a recover response from > a replica in STARTING status > I0108 01:13:16.128237 1354 recover.cpp:564] Updating replica status to VOTING > I0108 01:13:16.128402 1359 master.cpp:1629] The newly elected leader is > master@172.17.0.2:32801 with id 773d31e8-383d-4e4b-aa68-f9a3fb9f1fc2 > I0108 01:13:16.128489 1359 master.cpp:1642] Elected as the leading master! > I0108 01:13:16.128523 1359 master.cpp:1387] Recovering from registrar > I0108 01:13:16.128756 1355 registrar.cpp:307] Recovering registrar > I0108 01:13:16.129259 1344 leveldb.cpp:304] Persistin
[jira] [Commented] (MESOS-4229) Docker containers left running on disk after reviewbot builds
[ https://issues.apache.org/jira/browse/MESOS-4229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15089973#comment-15089973 ] Adam B commented on MESOS-4229: --- May have been introduced by MESOS-3900? cc: [~jojy] [~vinodkone] > Docker containers left running on disk after reviewbot builds > - > > Key: MESOS-4229 > URL: https://issues.apache.org/jira/browse/MESOS-4229 > Project: Mesos > Issue Type: Bug > Environment: ASF Mesos Reviewbot >Reporter: Greg Mann > Labels: build, mesosphere, test > > The Mesos Reviewbot builds recently failed due to Docker containers being > left running on the disk, eventually leading to a full disk: > https://issues.apache.org/jira/browse/INFRA-10984 > These containers should be automatically cleaned up to avoid this problem in > the future. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-4318) PersistentVolumeTest.BadACLNoPrincipal is flaky
Jie Yu created MESOS-4318: - Summary: PersistentVolumeTest.BadACLNoPrincipal is flaky Key: MESOS-4318 URL: https://issues.apache.org/jira/browse/MESOS-4318 Project: Mesos Issue Type: Bug Reporter: Jie Yu https://builds.apache.org/job/Mesos/1457/COMPILER=gcc,CONFIGURATION=--verbose%20--enable-libevent%20--enable-ssl,OS=centos:7,label_exp=docker%7C%7CHadoop/consoleFull {noformat} [ RUN ] PersistentVolumeTest.BadACLNoPrincipal I0108 01:13:16.117883 1325 leveldb.cpp:174] Opened db in 2.614722ms I0108 01:13:16.118650 1325 leveldb.cpp:181] Compacted db in 706567ns I0108 01:13:16.118702 1325 leveldb.cpp:196] Created db iterator in 24489ns I0108 01:13:16.118723 1325 leveldb.cpp:202] Seeked to beginning of db in 2436ns I0108 01:13:16.118738 1325 leveldb.cpp:271] Iterated through 0 keys in the db in 397ns I0108 01:13:16.118793 1325 replica.cpp:779] Replica recovered with log positions 0 -> 0 with 1 holes and 0 unlearned I0108 01:13:16.119627 1348 recover.cpp:447] Starting replica recovery I0108 01:13:16.120352 1348 recover.cpp:473] Replica is in EMPTY status I0108 01:13:16.121750 1357 replica.cpp:673] Replica in EMPTY status received a broadcasted recover request from (7084)@172.17.0.2:32801 I0108 01:13:16.122297 1353 recover.cpp:193] Received a recover response from a replica in EMPTY status I0108 01:13:16.122747 1350 recover.cpp:564] Updating replica status to STARTING I0108 01:13:16.123625 1354 master.cpp:365] Master 773d31e8-383d-4e4b-aa68-f9a3fb9f1fc2 (d9632dd1c41e) started on 172.17.0.2:32801 I0108 01:13:16.123946 1347 leveldb.cpp:304] Persisting metadata (8 bytes) to leveldb took 728242ns I0108 01:13:16.123999 1347 replica.cpp:320] Persisted replica status to STARTING I0108 01:13:16.123708 1354 master.cpp:367] Flags at startup: --acls="create_volumes { principals { values: "test-principal" } volume_types { type: ANY } } create_volumes { principals { type: ANY } volume_types { type: NONE } } " --allocation_interval="1secs" --allocator="HierarchicalDRF" --authenticate="false" --authenticate_slaves="true" --authenticators="crammd5" --authorizers="local" --credentials="/tmp/f2rA75/credentials" --framework_sorter="drf" --help="false" --hostname_lookup="true" --initialize_driver_logging="true" --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" --max_slave_ping_timeouts="5" --quiet="false" --recovery_slave_removal_limit="100%" --registry="replicated_log" --registry_fetch_timeout="1mins" --registry_store_timeout="25secs" --registry_strict="true" --roles="role1" --root_submissions="true" --slave_ping_timeout="15secs" --slave_reregister_timeout="10mins" --user_sorter="drf" --version="false" --webui_dir="/mesos/mesos-0.27.0/_inst/share/mesos/webui" --work_dir="/tmp/f2rA75/master" --zk_session_timeout="10secs" I0108 01:13:16.124219 1354 master.cpp:414] Master allowing unauthenticated frameworks to register I0108 01:13:16.124236 1354 master.cpp:417] Master only allowing authenticated slaves to register I0108 01:13:16.124248 1354 credentials.hpp:35] Loading credentials for authentication from '/tmp/f2rA75/credentials' I0108 01:13:16.124294 1358 recover.cpp:473] Replica is in STARTING status I0108 01:13:16.124644 1354 master.cpp:456] Using default 'crammd5' authenticator I0108 01:13:16.124820 1354 master.cpp:493] Authorization enabled W0108 01:13:16.124843 1354 master.cpp:553] The '--roles' flag is deprecated. This flag will be removed in the future. See the Mesos 0.27 upgrade notes for more information I0108 01:13:16.125154 1348 hierarchical.cpp:147] Initialized hierarchical allocator process I0108 01:13:16.125334 1345 whitelist_watcher.cpp:77] No whitelist given I0108 01:13:16.126065 1346 replica.cpp:673] Replica in STARTING status received a broadcasted recover request from (7085)@172.17.0.2:32801 I0108 01:13:16.126806 1348 recover.cpp:193] Received a recover response from a replica in STARTING status I0108 01:13:16.128237 1354 recover.cpp:564] Updating replica status to VOTING I0108 01:13:16.128402 1359 master.cpp:1629] The newly elected leader is master@172.17.0.2:32801 with id 773d31e8-383d-4e4b-aa68-f9a3fb9f1fc2 I0108 01:13:16.128489 1359 master.cpp:1642] Elected as the leading master! I0108 01:13:16.128523 1359 master.cpp:1387] Recovering from registrar I0108 01:13:16.128756 1355 registrar.cpp:307] Recovering registrar I0108 01:13:16.129259 1344 leveldb.cpp:304] Persisting metadata (8 bytes) to leveldb took 531437ns I0108 01:13:16.129292 1344 replica.cpp:320] Persisted replica status to VOTING I0108 01:13:16.129425 1358 recover.cpp:578] Successfully joined the Paxos group I0108 01:13:16.129680 1358 recover.cpp:462] Recover process terminated I0108 01:13:16.130187 1358 log.cpp:659] Attempting to start the writer I0108 01:13:16.131613 1352 replica.cpp:493] Replica received implicit pro
[jira] [Updated] (MESOS-4258) Generate xml test reports in the jenkins build.
[ https://issues.apache.org/jira/browse/MESOS-4258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-4258: --- Shepherd: Benjamin Mahler > Generate xml test reports in the jenkins build. > --- > > Key: MESOS-4258 > URL: https://issues.apache.org/jira/browse/MESOS-4258 > Project: Mesos > Issue Type: Task > Components: test >Reporter: Benjamin Mahler >Assignee: Shuai Lin > Labels: newbie > > Google test has a flag for generating reports: > {{--gtest_output=xml:report.xml}} > Jenkins can display these reports via the xUnit plugin, which has support for > google test xml: https://wiki.jenkins-ci.org/display/JENKINS/xUnit+Plugin > This lets us quickly see which test failed, as well as the time that each > test took to run. > We should wire this up. One difficulty is that 'make distclean' complains > because the .xml files are left over (we could update distclean to wipe any > .xml files within the test locations): > {noformat} > ERROR: files left in build directory after distclean: > ./3rdparty/libprocess/3rdparty/report.xml > ./3rdparty/libprocess/report.xml > ./src/report.xml > make[1]: *** [distcleancheck] Error 1 > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4317) Document use of mesos specific future design patterns in gmock test framework
[ https://issues.apache.org/jira/browse/MESOS-4317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Avinash Sridharan updated MESOS-4317: - Description: Mesos relies heavily on google test and google mock frameworks for its unit test infrastructure. In order to support unit testing of mesos classes that are inherently designed to be multi-threaded (or multi-process), and asynchronous in nature, the libprocess future/promise design patterns have been used to expose a set of API that allow for asynchronous callbacks within the mesos specific gmock test framework (3rdparty/libprocess/include/process/gmock.hpp) . Given that these future/promise based API is very specific to the apache mesos test framework it would be good to have documentation about its use-cases to better inform developers (especially newbies) of this infrastructure. was: Mesos relies heavily on google test and google mock frameworks for its unit test infrastructure. In order to support unit testing of mesos classes that are inherently designed to be multi-threaded (or multi-process), and asynchronous in nature, the libprocess future/promise design patterns have been used to expose a set of API that allow for asynchronous callbacks within the mesos specific gmock test framework (3rdparty/libprocess/include/process/gmock.hpp) . Given that these future/promise based API is very specific to the apache mesos test framework it would be good to have documentation to better inform developers (especially newbies) of the infrastructure and its use-cases. > Document use of mesos specific future design patterns in gmock test framework > - > > Key: MESOS-4317 > URL: https://issues.apache.org/jira/browse/MESOS-4317 > Project: Mesos > Issue Type: Documentation >Reporter: Avinash Sridharan >Priority: Minor > > Mesos relies heavily on google test and google mock frameworks for its unit > test infrastructure. In order to support unit testing of mesos classes that > are inherently designed to be multi-threaded (or multi-process), and > asynchronous in nature, the libprocess future/promise design patterns have > been used to expose a set of API that allow for asynchronous callbacks within > the mesos specific gmock test framework > (3rdparty/libprocess/include/process/gmock.hpp) . > Given that these future/promise based API is very specific to the apache > mesos test framework it would be good to have documentation about its > use-cases to better inform developers (especially newbies) of this > infrastructure. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-4317) Document use of mesos specific future design patterns in gmock test framework
Avinash Sridharan created MESOS-4317: Summary: Document use of mesos specific future design patterns in gmock test framework Key: MESOS-4317 URL: https://issues.apache.org/jira/browse/MESOS-4317 Project: Mesos Issue Type: Documentation Reporter: Avinash Sridharan Priority: Minor Mesos relies heavily on google test and google mock frameworks for its unit test infrastructure. In order to support unit testing of mesos classes that are inherently designed to be multi-threaded (or multi-process), and asynchronous in nature, the libprocess future/promise design patterns have been used to expose a set of API that allow for asynchronous callbacks within the mesos specific gmock test framework (3rdparty/libprocess/include/process/gmock.hpp) . Given that these future/promise based API is very specific to the apache mesos test framework it would be good to have documentation to better inform developers (especially newbies) of the infrastructure and its use-cases. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-3746) Consider introducing a mechanism to provide feedback on offer operations
[ https://issues.apache.org/jira/browse/MESOS-3746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neil Conway updated MESOS-3746: --- Assignee: (was: Neil Conway) > Consider introducing a mechanism to provide feedback on offer operations > > > Key: MESOS-3746 > URL: https://issues.apache.org/jira/browse/MESOS-3746 > Project: Mesos > Issue Type: Task > Components: master >Reporter: Michael Park > Labels: mesosphere, persistent-volumes, reservations > > Currently, the master does not provide a direct feedback to the framework > when an operation is dropped: > https://github.com/apache/mesos/blob/master/src/master/master.cpp#L1713-L1715 > A "subsequent offer" is used as the mechanism to determine whether an > operation succeeded or not, which is not sufficient if a framework mistakenly > sends invalid operations. There should be an immediate feedback as to whether > the request was "accepted". -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-3790) Zk connection should retry on EAI_NONAME
[ https://issues.apache.org/jira/browse/MESOS-3790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neil Conway updated MESOS-3790: --- Assignee: (was: Neil Conway) > Zk connection should retry on EAI_NONAME > > > Key: MESOS-3790 > URL: https://issues.apache.org/jira/browse/MESOS-3790 > Project: Mesos > Issue Type: Bug >Reporter: Neil Conway >Priority: Minor > Labels: mesosphere, zookeeper > > The zookeeper interface is designed to retry (once per second for up to ten > minutes) if one or more of the Zookeeper hostnames can't be resolved (see > [MESOS-1326] and [MESOS-1523]). > However, the current implementation assumes that a DNS resolution failure is > indicated by zookeeper_init() returning NULL and errno being set to EINVAL > (Zk translates getaddrinfo() failures into errno values). However, the > current Zk code does: > {code} > static int getaddrinfo_errno(int rc) { > switch(rc) { > case EAI_NONAME: > // ZOOKEEPER-1323 EAI_NODATA and EAI_ADDRFAMILY are deprecated in FreeBSD. > #if defined EAI_NODATA && EAI_NODATA != EAI_NONAME > case EAI_NODATA: > #endif > return ENOENT; > case EAI_MEMORY: > return ENOMEM; > default: > return EINVAL; > } > } > {code} > getaddrinfo() returns EAI_NONAME when "the node or service is not known"; per > discussion in [MESOS-2186], this seems to happen intermittently due to DNS > failures. > Proposed fix: looking at errno is always going to be somewhat fragile, but if > we're going to continue doing that, we should check for ENOENT as well as > EINVAL. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-3082) Perf related tests rely on 'cycles' which might not always be present.
[ https://issues.apache.org/jira/browse/MESOS-3082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15089443#comment-15089443 ] Jan Schlicht edited comment on MESOS-3082 at 1/8/16 4:26 PM: - Tests trying to sample using perf with the 'cycles' value can cause failures of other tests if run in a virtual machine that does not support _CPU performance counters_. E.g. running {{sudo ./bin/mesos-tests.sh --gtest_filter="\*ROOT_CGROUPS_Sample"}} will fail and sometimes keep a child process running. This process will block some cgroups from being removed. This affects all test processes that are run afterwards that try to clean up some cgroups before being run (mostly {{ROOT_CGROUPS_*}}). I'd suggest to disable these tests if in a virtual machine without _CPU performance counters_. was (Author: nfnt): Tests trying to sample using perf with the 'cycles' value can cause failures of other tests if run in a virtual machine that does not support _CPU performance counters_. E.g. running {{sudo ./bin/mesos-tests.sh --gtest_filter="\*ROOT_CGROUPS_Sample"}} will fail and sometimes keep a child process running. This process will block some cgroups from being removed. This affects all test processes that are run afterwards that try to clean up some cgroups before being run (mostly {{ROOT_CGROUPS_*}}). I'd suggest to disable these test if in a virtual machine without _CPU performance counters_. > Perf related tests rely on 'cycles' which might not always be present. > -- > > Key: MESOS-3082 > URL: https://issues.apache.org/jira/browse/MESOS-3082 > Project: Mesos > Issue Type: Bug > Environment: Ubuntu 14.04 (in a virtual machine) >Reporter: Benjamin Hindman >Assignee: Jan Schlicht > Labels: mesosphere > > When running the tests on Ubuntu 14.04 the 'cycles' value collected by perf > is always 0, meaning certain tests always fail. These lines in the test have > been commented out for now and a TODO has been attached which links to this > JIRA issue, since the solution is unclear. In particular, 'cycles' might not > properly be counted because it is a hardware counter and this particular > machine was a virtual machine. Either way, we should determine the best > events to collect from perf in either VM or physical settings. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-3082) Perf related tests rely on 'cycles' which might not always be present.
[ https://issues.apache.org/jira/browse/MESOS-3082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jan Schlicht updated MESOS-3082: Sprint: Mesosphere Sprint 26 > Perf related tests rely on 'cycles' which might not always be present. > -- > > Key: MESOS-3082 > URL: https://issues.apache.org/jira/browse/MESOS-3082 > Project: Mesos > Issue Type: Bug > Environment: Ubuntu 14.04 (in a virtual machine) >Reporter: Benjamin Hindman >Assignee: Jan Schlicht > Labels: mesosphere > > When running the tests on Ubuntu 14.04 the 'cycles' value collected by perf > is always 0, meaning certain tests always fail. These lines in the test have > been commented out for now and a TODO has been attached which links to this > JIRA issue, since the solution is unclear. In particular, 'cycles' might not > properly be counted because it is a hardware counter and this particular > machine was a virtual machine. Either way, we should determine the best > events to collect from perf in either VM or physical settings. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-3082) Perf related tests rely on 'cycles' which might not always be present.
[ https://issues.apache.org/jira/browse/MESOS-3082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15089443#comment-15089443 ] Jan Schlicht edited comment on MESOS-3082 at 1/8/16 4:22 PM: - Tests trying to sample using perf with the 'cycles' value can cause failures of other tests if run in a virtual machine that does not support _CPU performance counters_. E.g. running {{sudo ./bin/mesos-tests.sh --gtest_filter="\*ROOT_CGROUPS_Sample"}} will fail and sometimes keep a child process running. This process will block some cgroups from being removed. This affects all test processes that are run afterwards that try to clean up some cgroups before being run (mostly {{ROOT_CGROUPS_*}}). I'd suggest to disable these test if in a virtual machine without _CPU performance counters_. was (Author: nfnt): Tests trying to sample using perf with the 'cycles' value can cause failures of other tests if run on a virtual machine that does not support _CPU performance counters_. E.g. running {{sudo ./bin/mesos-tests.sh --gtest_filter="\*ROOT_CGROUPS_Sample"}} will fail and sometimes keep a child process running. This process will block some cgroups from being removed. This affects all test processes that are run afterwards that try to clean up some cgroups before being run (mostly {{ROOT_CGROUPS_*}}). I'd suggest to disable these test if in a virtual machine without _CPU performance counters_. > Perf related tests rely on 'cycles' which might not always be present. > -- > > Key: MESOS-3082 > URL: https://issues.apache.org/jira/browse/MESOS-3082 > Project: Mesos > Issue Type: Bug > Environment: Ubuntu 14.04 (in a virtual machine) >Reporter: Benjamin Hindman >Assignee: Jan Schlicht > Labels: mesosphere > > When running the tests on Ubuntu 14.04 the 'cycles' value collected by perf > is always 0, meaning certain tests always fail. These lines in the test have > been commented out for now and a TODO has been attached which links to this > JIRA issue, since the solution is unclear. In particular, 'cycles' might not > properly be counted because it is a hardware counter and this particular > machine was a virtual machine. Either way, we should determine the best > events to collect from perf in either VM or physical settings. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-3082) Perf related tests rely on 'cycles' which might not always be present.
[ https://issues.apache.org/jira/browse/MESOS-3082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15089443#comment-15089443 ] Jan Schlicht edited comment on MESOS-3082 at 1/8/16 4:20 PM: - Tests trying to sample using perf with the 'cycles' value can cause failures of other tests if run on a virtual machine that does not support _CPU performance counters_. E.g. running {{sudo ./bin/mesos-tests.sh --gtest_filter="\*ROOT_CGROUPS_Sample"}} will fail and sometimes keep a child process running. This process will block some cgroups from being removed. This affects all test processes that are run afterwards that try to clean up some cgroups before being run (mostly {{ROOT_CGROUPS_*}}). I'd suggest to disable these test if in a virtual machine without _CPU performance counters_. was (Author: nfnt): Tests trying to sample using perf with the 'cycles' value can cause failures of other tests if run on a virtual machine that does not support _CPU performance counters_. E.g. running {{sudo ./bin/mesos-tests.sh --gtest_filter="*ROOT_CGROUPS_Sample"}} will fail and sometimes keep a child process running. This process will block some cgroups from being removed. This affects all test processes that are run afterwards that try to clean up some cgroups before being run (mostly {{ROOT_CGROUPS_*}}). I'd suggest to disable these test if in a virtual machine without _CPU performance counters_. > Perf related tests rely on 'cycles' which might not always be present. > -- > > Key: MESOS-3082 > URL: https://issues.apache.org/jira/browse/MESOS-3082 > Project: Mesos > Issue Type: Bug > Environment: Ubuntu 14.04 (in a virtual machine) >Reporter: Benjamin Hindman > Labels: mesosphere > > When running the tests on Ubuntu 14.04 the 'cycles' value collected by perf > is always 0, meaning certain tests always fail. These lines in the test have > been commented out for now and a TODO has been attached which links to this > JIRA issue, since the solution is unclear. In particular, 'cycles' might not > properly be counted because it is a hardware counter and this particular > machine was a virtual machine. Either way, we should determine the best > events to collect from perf in either VM or physical settings. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (MESOS-3082) Perf related tests rely on 'cycles' which might not always be present.
[ https://issues.apache.org/jira/browse/MESOS-3082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jan Schlicht reassigned MESOS-3082: --- Assignee: Jan Schlicht > Perf related tests rely on 'cycles' which might not always be present. > -- > > Key: MESOS-3082 > URL: https://issues.apache.org/jira/browse/MESOS-3082 > Project: Mesos > Issue Type: Bug > Environment: Ubuntu 14.04 (in a virtual machine) >Reporter: Benjamin Hindman >Assignee: Jan Schlicht > Labels: mesosphere > > When running the tests on Ubuntu 14.04 the 'cycles' value collected by perf > is always 0, meaning certain tests always fail. These lines in the test have > been commented out for now and a TODO has been attached which links to this > JIRA issue, since the solution is unclear. In particular, 'cycles' might not > properly be counted because it is a hardware counter and this particular > machine was a virtual machine. Either way, we should determine the best > events to collect from perf in either VM or physical settings. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3082) Perf related tests rely on 'cycles' which might not always be present.
[ https://issues.apache.org/jira/browse/MESOS-3082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15089443#comment-15089443 ] Jan Schlicht commented on MESOS-3082: - Tests trying to sample using perf with the 'cycles' value can cause failures of other tests if run on a virtual machine that does not support _CPU performance counters_. E.g. running {{sudo ./bin/mesos-tests.sh --gtest_filter="*ROOT_CGROUPS_Sample"}} will fail and sometimes keep a child process running. This process will block some cgroups from being removed. This affects all test processes that are run afterwards that try to clean up some cgroups before being run (mostly {{ROOT_CGROUPS_*}}). I'd suggest to disable these test if in a virtual machine without _CPU performance counters_. > Perf related tests rely on 'cycles' which might not always be present. > -- > > Key: MESOS-3082 > URL: https://issues.apache.org/jira/browse/MESOS-3082 > Project: Mesos > Issue Type: Bug > Environment: Ubuntu 14.04 (in a virtual machine) >Reporter: Benjamin Hindman > Labels: mesosphere > > When running the tests on Ubuntu 14.04 the 'cycles' value collected by perf > is always 0, meaning certain tests always fail. These lines in the test have > been commented out for now and a TODO has been attached which links to this > JIRA issue, since the solution is unclear. In particular, 'cycles' might not > properly be counted because it is a hardware counter and this particular > machine was a virtual machine. Either way, we should determine the best > events to collect from perf in either VM or physical settings. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4279) Graceful restart of docker task
[ https://issues.apache.org/jira/browse/MESOS-4279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15089364#comment-15089364 ] Martin Bydzovsky commented on MESOS-4279: - Well I guess you introduced another "issue" in your test example. It's related to the way how you started the Marathon app. Please look at the explanation here: https://mesosphere.github.io/marathon/docs/native-docker.html#command-vs-args. In your {{ps}} output, you can see that the actual command is {{/bin/sh -c python /app/script.py}} - wrapped by sh -c. Seems like you started your Marathon app with something like: {code}curl -XPOST http://marathon:8080/v2/apps --data={id: "test-app", cmd: "python script.py", ...} {code} What I was showing in my examples above was: {code}curl -XPOST http://marathon:8080/v2/apps --data={id: "test-app", args: ["/tmp/script.py"], ...} {code} Usually this is called a "PID 1 problem" - https://medium.com/@gchudnov/trapping-signals-in-docker-containers-7a57fdda7d86#.zcxhq8yqn. Simply said, in your example the PID 1 inside the docker container is the shell process and the actual python script is pid 2. Default signal handlers for all processes EXCEPT pid 1 are to shutdown on SIGINT/SIGTERM. PID 1 default signal handlers just ignore them. So you could retry the example and use args instead of cmd. Then your {{ps}} output should look like: {code} root 10738 0.0 0.0 218228 14236 ? 15:22 0:00 docker run -c 102 -m 268435456 -e PORT_10002=31123 -e MARATHON_APP_VERSION=2016-01-08T15:22:49.646Z -e HOST=mesos-slave1.example.com -e MARATHON_APP_DOCKER_IMAGE=bydga/marathon-test-api -e MESOS_TASK_ID=marathon-test-api.ad9cbac5-b61b-11e5-af54-023bd987a59b -e PORT=31123 -e PORTS=31123 -e MARATHON_APP_ID=/marathon-test-api -e PORT0=31123 -e MESOS_SANDBOX=/mnt/mesos/sandbox -v /srv/mesos/slaves/20160106-114735-3423223818-5050-1508-S3/frameworks/20160106-083626-1258962954-5050-9311-/executors/marathon-test-api.ad9cbac5-b61b-11e5-af54-023bd987a59b/runs/bbeb80ab-e8d0-4b93-b7a0-6475787e090f:/mnt/mesos/sandbox --net host --name mesos-20160106-114735-3423223818-5050-1508-S3.bbeb80ab-e8d0-4b93-b7a0-6475787e090f bydga/marathon-test-api ./script.py root 10749 0.0 0.0 21576 4336 ? 15:22 0:00 /usr/bin/python ./script.py {code} With this setup, the docker stop works as expected: {code} bydzovskym mesos-slave1:aws ~ 🍺 docker ps CONTAINER IDIMAGE COMMAND CREATED STATUS PORTS NAMES ed4a35e4372cbydga/marathon-test-api "./script.py"7 minutes ago Up 7 minutes mesos-20160106-114735-3423223818-5050-1508-S3.bbeb80ab-e8d0-4b93-b7a0-6475787e090f bydzovskym mesos-slave1:aws ~ 🍺 time docker stop ed4a35e4372c ed4a35e4372c real0m2.184s user0m0.016s sys 0m0.042s {code} and the output of the dokcer: {code} bydzovskym mesos-slave1:aws ~ 🍺 docker logs -f ed4a35e4372c Hello 15:15:57.943294 Iteration #1 15:15:58.944470 Iteration #2 15:15:59.945631 Iteration #3 15:16:00.946794 got 15 15:16:40.473517 15:16:42.475655 ending Goodbye {code} The docker stop took a liiitle more than 2 seconds - as the grace period in the python script. I still guess the problem is somewhere in the mesos orchestrating the docker - either it sends wrong {{docker kill}} or it kills it even more painfully (killing the docker run with linux {{kill}} command... > Graceful restart of docker task > --- > > Key: MESOS-4279 > URL: https://issues.apache.org/jira/browse/MESOS-4279 > Project: Mesos > Issue Type: Bug > Components: containerization, docker >Affects Versions: 0.25.0 >Reporter: Martin Bydzovsky >Assignee: Qian Zhang > > I'm implementing a graceful restarts of our mesos-marathon-docker setup and I > came to a following issue: > (it was already discussed on > https://github.com/mesosphere/marathon/issues/2876 and guys form mesosphere > got to a point that its probably a docker containerizer problem...) > To sum it up: > When i deploy simple python script to all mesos-slaves: > {code} > #!/usr/bin/python > from time import sleep > import signal > import sys > import datetime > def sigterm_handler(_signo, _stack_frame): > print "got %i" % _signo > print datetime.datetime.now().time() > sys.stdout.flush() > sleep(2) > print datetime.datetime.now().time() > print "ending" > sys.stdout.flush() > sys.exit(0) > signal.signal(signal.SIGTERM, sigterm_handler) > signal.signal(signal.SIGINT, sigterm_handler) > try: > print "Hello" > i = 0 > while True: > i += 1 > print datetime.datetime.now().time() > print "Iteration #%i" % i > sys.stdo
[jira] [Commented] (MESOS-4279) Graceful restart of docker task
[ https://issues.apache.org/jira/browse/MESOS-4279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15089301#comment-15089301 ] Qian Zhang commented on MESOS-4279: --- When creating an app of Docker type in Marathon, the processes launched in Mesos agent is like: {code} root 2086 2063 0 Jan06 ?00:00:49 docker -H unix:///var/run/docker.sock run -c 102 -m 33554432 -e MARATHON_APP_VERSION=2016-01-06T14:24:40.412Z -e HOST=mesos -e MARATHON_APP_DOCKER_IMAGE=mesos-4279 -e PORT_1=31433 -e MESOS_TASK_ID=app-docker1.af64d5d2-b481-11e5-bdf1-0242497320ff -e PORT=31433 -e PORTS=31433 -e MARATHON_APP_ID=/app-docker1 -e PORT0=31433 -e MESOS_SANDBOX=/mnt/mesos/sandbox -e MESOS_CONTAINER_NAME=mesos-9ee670be-3c38-4c23-91c1-826b283dd283-S7.a919ce36-9b6e-4086-bfe8-9f0a34a3f471 -v /tmp/mesos/slaves/9ee670be-3c38-4c23-91c1-826b283dd283-S7/frameworks/83ced7f5-69b3-409b-abe5-a582a5d278cd-/executors/app-docker1.af64d5d2-b481-11e5-bdf1-0242497320ff/runs/a919ce36-9b6e-4086-bfe8-9f0a34a3f471:/mnt/mesos/sandbox --net bridge --entrypoint /bin/sh --name mesos-9ee670be-3c38-4c23-91c1-826b283dd283-S7.a919ce36-9b6e-4086-bfe8-9f0a34a3f471 mesos-4279 -c python /app/script.py root 2124 2103 0 Jan06 ?00:00:00 /bin/sh -c python /app/script.py root 2140 2124 0 Jan06 ?00:00:35 python /app/script.py {code} The first process (2086) is the "docker run" command launched by Mesos docker executor, and the second & third process (2124 & 2140) are the app processes launched by Docker daemon. When restarting the app in Marathon, the Mesos docker executor will kill the app processes first, the way that it does the "kill" is to run "docker stop" command (https://github.com/apache/mesos/blob/0.26.0/src/docker/executor.cpp#L218), and the "docker stop" command will ONLY send SIGTERM to the process 2124, but NOT to 2140 (the actual user script), that's why the signal handler in user script is not triggered. However for the app which is not Docker type, when killing it, the executor will send SIGTERM to the process group (https://github.com/apache/mesos/blob/0.26.0/src/launcher/executor.cpp#L419), so the user script can get the signal too. I am not sure if there is a way for "docker stop" to not only send SIGTERM to the parent process of user script process but also to the user script process itself ... > Graceful restart of docker task > --- > > Key: MESOS-4279 > URL: https://issues.apache.org/jira/browse/MESOS-4279 > Project: Mesos > Issue Type: Bug > Components: containerization, docker >Affects Versions: 0.25.0 >Reporter: Martin Bydzovsky >Assignee: Qian Zhang > > I'm implementing a graceful restarts of our mesos-marathon-docker setup and I > came to a following issue: > (it was already discussed on > https://github.com/mesosphere/marathon/issues/2876 and guys form mesosphere > got to a point that its probably a docker containerizer problem...) > To sum it up: > When i deploy simple python script to all mesos-slaves: > {code} > #!/usr/bin/python > from time import sleep > import signal > import sys > import datetime > def sigterm_handler(_signo, _stack_frame): > print "got %i" % _signo > print datetime.datetime.now().time() > sys.stdout.flush() > sleep(2) > print datetime.datetime.now().time() > print "ending" > sys.stdout.flush() > sys.exit(0) > signal.signal(signal.SIGTERM, sigterm_handler) > signal.signal(signal.SIGINT, sigterm_handler) > try: > print "Hello" > i = 0 > while True: > i += 1 > print datetime.datetime.now().time() > print "Iteration #%i" % i > sys.stdout.flush() > sleep(1) > finally: > print "Goodbye" > {code} > and I run it through Marathon like > {code:javascript} > data = { > args: ["/tmp/script.py"], > instances: 1, > cpus: 0.1, > mem: 256, > id: "marathon-test-api" > } > {code} > During the app restart I get expected result - the task receives sigterm and > dies peacefully (during my script-specified 2 seconds period) > But when i wrap this python script in a docker: > {code} > FROM node:4.2 > RUN mkdir /app > ADD . /app > WORKDIR /app > ENTRYPOINT [] > {code} > and run appropriate application by Marathon: > {code:javascript} > data = { > args: ["./script.py"], > container: { > type: "DOCKER", > docker: { > image: "bydga/marathon-test-api" > }, > forcePullImage: yes > }, > cpus: 0.1, > mem: 256, > instances: 1, > id: "marathon-test-api" > } > {code} > The task during restart (issued from marathon) dies immediately without > having a chance to do any cleanup. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-4316) Support get non-default weights by /weights
Yongqiao Wang created MESOS-4316: Summary: Support get non-default weights by /weights Key: MESOS-4316 URL: https://issues.apache.org/jira/browse/MESOS-4316 Project: Mesos Issue Type: Task Reporter: Yongqiao Wang Assignee: Yongqiao Wang Priority: Minor Like /quota, we should also add query logic for /weights to keep consistent. Then /roles no longer needs to show weight information. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4035) UserCgroupIsolatorTest.ROOT_CGROUPS_UserCgroup fails on CentOS 6.6
[ https://issues.apache.org/jira/browse/MESOS-4035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15089264#comment-15089264 ] Jan Schlicht commented on MESOS-4035: - This doesn't seem to be related to the perf support. I could reproduce this on a virtual machine where {{sudo ./bin/mesos-tests.sh --gtest_filter="*ROOT_CGROUPS_UserCgroup"}} was the first command being run. There are some issues where perf related tests could fail and leave a running process that could influence subsequent test runs, but this problem seems to be different. > UserCgroupIsolatorTest.ROOT_CGROUPS_UserCgroup fails on CentOS 6.6 > -- > > Key: MESOS-4035 > URL: https://issues.apache.org/jira/browse/MESOS-4035 > Project: Mesos > Issue Type: Bug > Environment: CentOS6.6 >Reporter: Gilbert Song >Assignee: Jan Schlicht > > `ROOT_CGROUPS_UserCgroup` on CentOS6.6 with 0.26rc3. The environment setup on > CentOS6.6 is based on latest update of /docs/getting-started.md. Either using > devtoolset-2 or devtoolset-3 returns the same failure. > If running `sudo ./bin/mesos-tests.sh > --gtest_filter="*ROOT_CGROUPS_UserCgroup*"`, it would return failures as > following log: > {noformat} > [==] Running 3 tests from 3 test cases. > [--] Global test environment set-up. > [--] 1 test from UserCgroupIsolatorTest/0, where TypeParam = > mesos::internal::slave::CgroupsMemIsolatorProcess > userdel: user 'mesos.test.unprivileged.user' does not exist > [ RUN ] UserCgroupIsolatorTest/0.ROOT_CGROUPS_UserCgroup > ../../src/tests/mesos.cpp:722: Failure > cgroups::mount(hierarchy, subsystem): '/tmp/mesos_test_cgroup/perf_event' > already exists in the file system > - > We cannot run any cgroups tests that require > a hierarchy with subsystem 'perf_event' > because we failed to find an existing hierarchy > or create a new one (tried '/tmp/mesos_test_cgroup/perf_event'). > You can either remove all existing > hierarchies, or disable this test case > (i.e., --gtest_filter=-UserCgroupIsolatorTest/0.*). > - > ../../src/tests/mesos.cpp:776: Failure > cgroups: '/tmp/mesos_test_cgroup/perf_event' is not a valid hierarchy > [ FAILED ] UserCgroupIsolatorTest/0.ROOT_CGROUPS_UserCgroup, where > TypeParam = mesos::internal::slave::CgroupsMemIsolatorProcess (1 ms) > [--] 1 test from UserCgroupIsolatorTest/0 (1 ms total) > [--] 1 test from UserCgroupIsolatorTest/1, where TypeParam = > mesos::internal::slave::CgroupsCpushareIsolatorProcess > userdel: user 'mesos.test.unprivileged.user' does not exist > [ RUN ] UserCgroupIsolatorTest/1.ROOT_CGROUPS_UserCgroup > ../../src/tests/mesos.cpp:722: Failure > cgroups::mount(hierarchy, subsystem): '/tmp/mesos_test_cgroup/perf_event' > already exists in the file system > - > We cannot run any cgroups tests that require > a hierarchy with subsystem 'perf_event' > because we failed to find an existing hierarchy > or create a new one (tried '/tmp/mesos_test_cgroup/perf_event'). > You can either remove all existing > hierarchies, or disable this test case > (i.e., --gtest_filter=-UserCgroupIsolatorTest/1.*). > - > ../../src/tests/mesos.cpp:776: Failure > cgroups: '/tmp/mesos_test_cgroup/perf_event' is not a valid hierarchy > [ FAILED ] UserCgroupIsolatorTest/1.ROOT_CGROUPS_UserCgroup, where > TypeParam = mesos::internal::slave::CgroupsCpushareIsolatorProcess (4 ms) > [--] 1 test from UserCgroupIsolatorTest/1 (5 ms total) > [--] 1 test from UserCgroupIsolatorTest/2, where TypeParam = > mesos::internal::slave::CgroupsPerfEventIsolatorProcess > userdel: user 'mesos.test.unprivileged.user' does not exist > [ RUN ] UserCgroupIsolatorTest/2.ROOT_CGROUPS_UserCgroup > ../../src/tests/mesos.cpp:722: Failure > cgroups::mount(hierarchy, subsystem): '/tmp/mesos_test_cgroup/perf_event' > already exists in the file system > - > We cannot run any cgroups tests that require > a hierarchy with subsystem 'perf_event' > because we failed to find an existing hierarchy > or create a new one (tried '/tmp/mesos_test_cgroup/perf_event'). > You can either remove all existing > hierarchies, or disable this test case > (i.e., --gtest_filter=-UserCgroupIsolatorTest/2.*). > - > ../../src/tests/mesos.cpp:776: Failure > cgroups: '/tmp/mesos_test_cgroup/perf_event' is not a valid hierarchy > [ FAILED ] UserCgroupIsolatorTest/2.ROOT_CGROUPS_UserCgroup, where > TypeParam = mesos::internal::slave::
[jira] [Issue Comment Deleted] (MESOS-4035) UserCgroupIsolatorTest.ROOT_CGROUPS_UserCgroup fails on CentOS 6.6
[ https://issues.apache.org/jira/browse/MESOS-4035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jan Schlicht updated MESOS-4035: Comment: was deleted (was: I assume that this was in a virtual machine and that something like {{sudo ./bin/mesos-tests.sh}} was running prior to this? If tried reproducing this and am pretty sure, that I've seen the exact same error before, but could only find something that is quite similar and probably having the same cause: Some virtual machines (e.g. Virtualbox) don't provide _CPU performance counters_ for their guests. This affects some root tests of Mesos that try to use {{perf}} to sample the {{cycles}} event. One of these tests is {{PerfEventIsolatorTest.ROOT_CGROUPS_Sample}}. Running {{sudo ./bin/mesos-tests.sh --gtest_filter="*ROOT_CGROUPS_Sample"}} in such an environment will fail and keep a child process running that will block some cgroups from being removed. This affects all test processes that are run afterwards that try to clean up some cgroups before being run. {{UserCgroupIsolatorTest.ROOT_CGROUPS_UserCgroup}} is one of those. Restarting the VM will reset this behavior. So, in a fresh VM, running {{sudo ./bin/mesos-tests.sh --gtest_filter="*ROOT_CGROUPS_UserCgroup"}} should pass, but doing this after running {{sudo ./bin/mesos-tests.sh --gtest_filter="*ROOT_CGROUPS_Sample"}} should fail.) > UserCgroupIsolatorTest.ROOT_CGROUPS_UserCgroup fails on CentOS 6.6 > -- > > Key: MESOS-4035 > URL: https://issues.apache.org/jira/browse/MESOS-4035 > Project: Mesos > Issue Type: Bug > Environment: CentOS6.6 >Reporter: Gilbert Song >Assignee: Jan Schlicht > > `ROOT_CGROUPS_UserCgroup` on CentOS6.6 with 0.26rc3. The environment setup on > CentOS6.6 is based on latest update of /docs/getting-started.md. Either using > devtoolset-2 or devtoolset-3 returns the same failure. > If running `sudo ./bin/mesos-tests.sh > --gtest_filter="*ROOT_CGROUPS_UserCgroup*"`, it would return failures as > following log: > {noformat} > [==] Running 3 tests from 3 test cases. > [--] Global test environment set-up. > [--] 1 test from UserCgroupIsolatorTest/0, where TypeParam = > mesos::internal::slave::CgroupsMemIsolatorProcess > userdel: user 'mesos.test.unprivileged.user' does not exist > [ RUN ] UserCgroupIsolatorTest/0.ROOT_CGROUPS_UserCgroup > ../../src/tests/mesos.cpp:722: Failure > cgroups::mount(hierarchy, subsystem): '/tmp/mesos_test_cgroup/perf_event' > already exists in the file system > - > We cannot run any cgroups tests that require > a hierarchy with subsystem 'perf_event' > because we failed to find an existing hierarchy > or create a new one (tried '/tmp/mesos_test_cgroup/perf_event'). > You can either remove all existing > hierarchies, or disable this test case > (i.e., --gtest_filter=-UserCgroupIsolatorTest/0.*). > - > ../../src/tests/mesos.cpp:776: Failure > cgroups: '/tmp/mesos_test_cgroup/perf_event' is not a valid hierarchy > [ FAILED ] UserCgroupIsolatorTest/0.ROOT_CGROUPS_UserCgroup, where > TypeParam = mesos::internal::slave::CgroupsMemIsolatorProcess (1 ms) > [--] 1 test from UserCgroupIsolatorTest/0 (1 ms total) > [--] 1 test from UserCgroupIsolatorTest/1, where TypeParam = > mesos::internal::slave::CgroupsCpushareIsolatorProcess > userdel: user 'mesos.test.unprivileged.user' does not exist > [ RUN ] UserCgroupIsolatorTest/1.ROOT_CGROUPS_UserCgroup > ../../src/tests/mesos.cpp:722: Failure > cgroups::mount(hierarchy, subsystem): '/tmp/mesos_test_cgroup/perf_event' > already exists in the file system > - > We cannot run any cgroups tests that require > a hierarchy with subsystem 'perf_event' > because we failed to find an existing hierarchy > or create a new one (tried '/tmp/mesos_test_cgroup/perf_event'). > You can either remove all existing > hierarchies, or disable this test case > (i.e., --gtest_filter=-UserCgroupIsolatorTest/1.*). > - > ../../src/tests/mesos.cpp:776: Failure > cgroups: '/tmp/mesos_test_cgroup/perf_event' is not a valid hierarchy > [ FAILED ] UserCgroupIsolatorTest/1.ROOT_CGROUPS_UserCgroup, where > TypeParam = mesos::internal::slave::CgroupsCpushareIsolatorProcess (4 ms) > [--] 1 test from UserCgroupIsolatorTest/1 (5 ms total) > [--] 1 test from UserCgroupIsolatorTest/2, where TypeParam = > mesos::internal::slave::CgroupsPerfEventIsolatorProcess > userdel: user 'mesos.test.unprivileged.user' does not exist > [ RUN ] UserCgroupIsolatorTest/2.ROOT_CGROUPS_UserCgroup > ../../src/tests/mesos.cpp:722: Fai
[jira] [Commented] (MESOS-4035) UserCgroupIsolatorTest.ROOT_CGROUPS_UserCgroup fails on CentOS 6.6
[ https://issues.apache.org/jira/browse/MESOS-4035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15089240#comment-15089240 ] Jan Schlicht commented on MESOS-4035: - I assume that this was in a virtual machine and that something like {{sudo ./bin/mesos-tests.sh}} was running prior to this? If tried reproducing this and am pretty sure, that I've seen the exact same error before, but could only find something that is quite similar and probably having the same cause: Some virtual machines (e.g. Virtualbox) don't provide _CPU performance counters_ for their guests. This affects some root tests of Mesos that try to use {{perf}} to sample the {{cycles}} event. One of these tests is {{PerfEventIsolatorTest.ROOT_CGROUPS_Sample}}. Running {{sudo ./bin/mesos-tests.sh --gtest_filter="*ROOT_CGROUPS_Sample"}} in such an environment will fail and keep a child process running that will block some cgroups from being removed. This affects all test processes that are run afterwards that try to clean up some cgroups before being run. {{UserCgroupIsolatorTest.ROOT_CGROUPS_UserCgroup}} is one of those. Restarting the VM will reset this behavior. So, in a fresh VM, running {{sudo ./bin/mesos-tests.sh --gtest_filter="*ROOT_CGROUPS_UserCgroup"}} should pass, but doing this after running {{sudo ./bin/mesos-tests.sh --gtest_filter="*ROOT_CGROUPS_Sample"}} should fail. > UserCgroupIsolatorTest.ROOT_CGROUPS_UserCgroup fails on CentOS 6.6 > -- > > Key: MESOS-4035 > URL: https://issues.apache.org/jira/browse/MESOS-4035 > Project: Mesos > Issue Type: Bug > Environment: CentOS6.6 >Reporter: Gilbert Song >Assignee: Jan Schlicht > > `ROOT_CGROUPS_UserCgroup` on CentOS6.6 with 0.26rc3. The environment setup on > CentOS6.6 is based on latest update of /docs/getting-started.md. Either using > devtoolset-2 or devtoolset-3 returns the same failure. > If running `sudo ./bin/mesos-tests.sh > --gtest_filter="*ROOT_CGROUPS_UserCgroup*"`, it would return failures as > following log: > {noformat} > [==] Running 3 tests from 3 test cases. > [--] Global test environment set-up. > [--] 1 test from UserCgroupIsolatorTest/0, where TypeParam = > mesos::internal::slave::CgroupsMemIsolatorProcess > userdel: user 'mesos.test.unprivileged.user' does not exist > [ RUN ] UserCgroupIsolatorTest/0.ROOT_CGROUPS_UserCgroup > ../../src/tests/mesos.cpp:722: Failure > cgroups::mount(hierarchy, subsystem): '/tmp/mesos_test_cgroup/perf_event' > already exists in the file system > - > We cannot run any cgroups tests that require > a hierarchy with subsystem 'perf_event' > because we failed to find an existing hierarchy > or create a new one (tried '/tmp/mesos_test_cgroup/perf_event'). > You can either remove all existing > hierarchies, or disable this test case > (i.e., --gtest_filter=-UserCgroupIsolatorTest/0.*). > - > ../../src/tests/mesos.cpp:776: Failure > cgroups: '/tmp/mesos_test_cgroup/perf_event' is not a valid hierarchy > [ FAILED ] UserCgroupIsolatorTest/0.ROOT_CGROUPS_UserCgroup, where > TypeParam = mesos::internal::slave::CgroupsMemIsolatorProcess (1 ms) > [--] 1 test from UserCgroupIsolatorTest/0 (1 ms total) > [--] 1 test from UserCgroupIsolatorTest/1, where TypeParam = > mesos::internal::slave::CgroupsCpushareIsolatorProcess > userdel: user 'mesos.test.unprivileged.user' does not exist > [ RUN ] UserCgroupIsolatorTest/1.ROOT_CGROUPS_UserCgroup > ../../src/tests/mesos.cpp:722: Failure > cgroups::mount(hierarchy, subsystem): '/tmp/mesos_test_cgroup/perf_event' > already exists in the file system > - > We cannot run any cgroups tests that require > a hierarchy with subsystem 'perf_event' > because we failed to find an existing hierarchy > or create a new one (tried '/tmp/mesos_test_cgroup/perf_event'). > You can either remove all existing > hierarchies, or disable this test case > (i.e., --gtest_filter=-UserCgroupIsolatorTest/1.*). > - > ../../src/tests/mesos.cpp:776: Failure > cgroups: '/tmp/mesos_test_cgroup/perf_event' is not a valid hierarchy > [ FAILED ] UserCgroupIsolatorTest/1.ROOT_CGROUPS_UserCgroup, where > TypeParam = mesos::internal::slave::CgroupsCpushareIsolatorProcess (4 ms) > [--] 1 test from UserCgroupIsolatorTest/1 (5 ms total) > [--] 1 test from UserCgroupIsolatorTest/2, where TypeParam = > mesos::internal::slave::CgroupsPerfEventIsolatorProcess > userdel: user 'mesos.test.unprivileged.user' does not exist > [ RUN ] UserCgroupIsolatorTest/2.ROOT_CGROUPS_UserCgroup > ../../src
[jira] [Created] (MESOS-4315) Improve Quota Failover Logic
Joerg Schad created MESOS-4315: -- Summary: Improve Quota Failover Logic Key: MESOS-4315 URL: https://issues.apache.org/jira/browse/MESOS-4315 Project: Mesos Issue Type: Improvement Reporter: Joerg Schad The Quota failover logic introduced with MESOS-3865 changes the the master failover recovery changes significantly if at least one quota is set. Now, if upon recovery any previously set quota have been detected, the allocator enters recovery mode, during which the allocator does not issue offers. The recovery mode — and therefore offer suspension — ends when either: * A certain amount of agents reregisters (by default 80% of agents known before the failover), * a timeout expires (by default 10 minutes). We could also safely exit the recovery mode, once all quota has been satisfied (i.e. all agents participating in satisfying quota have reconnected). For small clusters a large percentage of quota'ed resources this will not make too much difference compared to the existing rules. But for larger clusters this condition could be fulfilled much faster than the 80% condition. We should at least consider whether such condition is worth the added complexity. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3877) Draft operator documentation for quota
[ https://issues.apache.org/jira/browse/MESOS-3877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15089179#comment-15089179 ] Joerg Schad commented on MESOS-3877: Finished draft and published review via MESOS-4314. > Draft operator documentation for quota > -- > > Key: MESOS-3877 > URL: https://issues.apache.org/jira/browse/MESOS-3877 > Project: Mesos > Issue Type: Task > Components: documentation >Reporter: Alexander Rukletsov >Assignee: Joerg Schad > Labels: mesosphere > > Draft an operator guide for quota which describes basic usage of the > endpoints and few basic and advanced usage cases. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4314) Publish Quota Documentation
[ https://issues.apache.org/jira/browse/MESOS-4314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joerg Schad updated MESOS-4314: --- Sprint: Mesosphere Sprint 26 > Publish Quota Documentation > --- > > Key: MESOS-4314 > URL: https://issues.apache.org/jira/browse/MESOS-4314 > Project: Mesos > Issue Type: Documentation >Reporter: Joerg Schad >Assignee: Joerg Schad > > Publish and finish the operator guide draft for quota which describes basic > usage of the endpoints and few basic and advanced usage cases. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-4314) Publish Quota Documentation
Joerg Schad created MESOS-4314: -- Summary: Publish Quota Documentation Key: MESOS-4314 URL: https://issues.apache.org/jira/browse/MESOS-4314 Project: Mesos Issue Type: Documentation Reporter: Joerg Schad Assignee: Joerg Schad Publish and finish the operator guide draft for quota which describes basic usage of the endpoints and few basic and advanced usage cases. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-3877) Draft operator documentation for quota
[ https://issues.apache.org/jira/browse/MESOS-3877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joerg Schad updated MESOS-3877: --- Description: Draft an operator guide for quota which describes basic usage of the endpoints and few basic and advanced usage cases. (was: Add an operator guide for quota which describes basic usage of the endpoints and few basic and advanced usage cases.) > Draft operator documentation for quota > -- > > Key: MESOS-3877 > URL: https://issues.apache.org/jira/browse/MESOS-3877 > Project: Mesos > Issue Type: Task > Components: documentation >Reporter: Alexander Rukletsov >Assignee: Joerg Schad > Labels: mesosphere > > Draft an operator guide for quota which describes basic usage of the > endpoints and few basic and advanced usage cases. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3307) Configurable size of completed task / framework history
[ https://issues.apache.org/jira/browse/MESOS-3307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15089129#comment-15089129 ] Ian Babrou commented on MESOS-3307: --- Having API params to fetch only interesting tasks would be very nice. Mesos DNS and similar tools don't care about the size of completed task history, it only cares about alive tasks. Many tools also only care about tasks with certain labels and/or ports allocated. Having mesos even bus similar to marathon's even bus would eliminate the need to do active polling altogether, but that takes time (is there an issue for this, btw?). I'm okay with having flags for history size, though, since [that's what I use now|https://github.com/cloudflare/mesos/commit/d247372226d6cbbe57fa856a0b3788e60200ef92]. > Configurable size of completed task / framework history > --- > > Key: MESOS-3307 > URL: https://issues.apache.org/jira/browse/MESOS-3307 > Project: Mesos > Issue Type: Bug >Reporter: Ian Babrou >Assignee: Kevin Klues > Labels: mesosphere > > We try to make Mesos work with multiple frameworks and mesos-dns at the same > time. The goal is to have set of frameworks per team / project on a single > Mesos cluster. > At this point our mesos state.json is at 4mb and it takes a while to > assembly. 5 mesos-dns instances hit state.json every 5 seconds, effectively > pushing mesos-master CPU usage through the roof. It's at 100%+ all the time. > Here's the problem: > {noformat} > mesos λ curl -s http://mesos-master:5050/master/state.json | jq > .frameworks[].completed_tasks[].framework_id | sort | uniq -c | sort -n >1 "20150606-001827-252388362-5050-5982-0003" > 16 "20150606-001827-252388362-5050-5982-0005" > 18 "20150606-001827-252388362-5050-5982-0029" > 73 "20150606-001827-252388362-5050-5982-0007" > 141 "20150606-001827-252388362-5050-5982-0009" > 154 "20150820-154817-302720010-5050-15320-" > 289 "20150606-001827-252388362-5050-5982-0004" > 510 "20150606-001827-252388362-5050-5982-0012" > 666 "20150606-001827-252388362-5050-5982-0028" > 923 "20150116-002612-269165578-5050-32204-0003" > 1000 "20150606-001827-252388362-5050-5982-0001" > 1000 "20150606-001827-252388362-5050-5982-0006" > 1000 "20150606-001827-252388362-5050-5982-0010" > 1000 "20150606-001827-252388362-5050-5982-0011" > 1000 "20150606-001827-252388362-5050-5982-0027" > mesos λ fgrep 1000 -r src/master > src/master/constants.cpp:const size_t MAX_REMOVED_SLAVES = 10; > src/master/constants.cpp:const uint32_t MAX_COMPLETED_TASKS_PER_FRAMEWORK = > 1000; > {noformat} > Active tasks are just 6% of state.json response: > {noformat} > mesos λ cat ~/temp/mesos-state.json | jq -c . | wc >1 14796 4138942 > mesos λ cat ~/temp/mesos-state.json | jq .frameworks[].tasks | jq -c . | wc > 16 37 252774 > {noformat} > I see four options that can improve the situation: > 1. Add query string param to exclude completed tasks from state.json and use > it in mesos-dns and similar tools. There is no need for mesos-dns to know > about completed tasks, it's just extra load on master and mesos-dns. > 2. Make history size configurable. > 3. Make JSON serialization faster. With 1s of tasks even without history > it would take a lot of time to serialize tasks for mesos-dns. Doing it every > 60 seconds instead of every 5 seconds isn't really an option. > 4. Create event bus for mesos master. Marathon has it and it'd be nice to > have it in Mesos. This way mesos-dns could avoid polling master state and > switch to listening for events. > All can be done independently. > Note to mesosphere folks: please start distributing debug symbols with your > distribution. I was asking for it for a while and it is really helpful: > https://github.com/mesosphere/marathon/issues/1497#issuecomment-104182501 > Perf report for leading master: > !http://i.imgur.com/iz7C3o0.png! > I'm on 0.23.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4312) Porting Mesos on Power (ppc64le)
[ https://issues.apache.org/jira/browse/MESOS-4312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15089066#comment-15089066 ] Qian Zhang commented on MESOS-4312: --- RR: https://reviews.apache.org/r/42068/ https://reviews.apache.org/r/42069/ > Porting Mesos on Power (ppc64le) > > > Key: MESOS-4312 > URL: https://issues.apache.org/jira/browse/MESOS-4312 > Project: Mesos > Issue Type: Improvement >Reporter: Qian Zhang >Assignee: Qian Zhang > > The goal of this ticket is to make IBM Power (ppc64le) as a supported > hardware platform of Mesos. Currently the latest Mesos code can not be > successfully built on ppc64le, we will resolve the build errors in this > ticket, and also make sure Mesos test suite ("make check") can be ran > successfully on ppc64le. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-4313) S
Joerg Schad created MESOS-4313: -- Summary: S Key: MESOS-4313 URL: https://issues.apache.org/jira/browse/MESOS-4313 Project: Mesos Issue Type: Bug Reporter: Joerg Schad -- This message was sent by Atlassian JIRA (v6.3.4#6332)