Build failed in Jenkins: mesos-reviewbot #1507
See https://builds.apache.org/job/mesos-reviewbot/1507/ -- [...truncated 5668 lines...] make[1]: Leaving directory `https://builds.apache.org/job/mesos-reviewbot/ws/mesos-0.21.0/_build' if test -d mesos-0.21.0; then find mesos-0.21.0 -type d ! -perm -200 -exec chmod u+w {} ';' rm -rf mesos-0.21.0 || { sleep 5 rm -rf mesos-0.21.0; }; else :; fi == mesos-0.21.0 archives ready for distribution: mesos-0.21.0.tar.gz == real88m4.316s user143m6.603s sys 7m53.711s + chmod -R +w 3rdparty CHANGELOG Doxyfile LICENSE Makefile Makefile.am Makefile.in NOTICE README.md aclocal.m4 ar-lib autom4te.cache bin bootstrap compile config.guess config.log config.lt config.status config.sub configure configure.ac depcomp docs ec2 frameworks include install-sh libtool ltmain.sh m4 mesos-0.21.0.tar.gz mesos.pc mesos.pc.in missing mpi src support + git clean -fdx Removing .libs/ Removing 3rdparty/Makefile Removing 3rdparty/Makefile.in Removing 3rdparty/libprocess/.deps/ Removing 3rdparty/libprocess/3rdparty/.deps/ Removing 3rdparty/libprocess/3rdparty/Makefile Removing 3rdparty/libprocess/3rdparty/Makefile.in Removing 3rdparty/libprocess/3rdparty/gmock_sources.cc Removing 3rdparty/libprocess/3rdparty/stout/Makefile Removing 3rdparty/libprocess/3rdparty/stout/Makefile.in Removing 3rdparty/libprocess/3rdparty/stout/aclocal.m4 Removing 3rdparty/libprocess/3rdparty/stout/autom4te.cache/ Removing 3rdparty/libprocess/3rdparty/stout/config.log Removing 3rdparty/libprocess/3rdparty/stout/config.status Removing 3rdparty/libprocess/3rdparty/stout/configure Removing 3rdparty/libprocess/3rdparty/stout/include/Makefile Removing 3rdparty/libprocess/3rdparty/stout/include/Makefile.in Removing 3rdparty/libprocess/3rdparty/stout/missing Removing 3rdparty/libprocess/Makefile Removing 3rdparty/libprocess/Makefile.in Removing 3rdparty/libprocess/aclocal.m4 Removing 3rdparty/libprocess/ar-lib Removing 3rdparty/libprocess/autom4te.cache/ Removing 3rdparty/libprocess/compile Removing 3rdparty/libprocess/config.guess Removing 3rdparty/libprocess/config.log Removing 3rdparty/libprocess/config.lt Removing 3rdparty/libprocess/config.status Removing 3rdparty/libprocess/config.sub Removing 3rdparty/libprocess/configure Removing 3rdparty/libprocess/depcomp Removing 3rdparty/libprocess/include/Makefile Removing 3rdparty/libprocess/include/Makefile.in Removing 3rdparty/libprocess/libtool Removing 3rdparty/libprocess/ltmain.sh Removing 3rdparty/libprocess/m4/libtool.m4 Removing 3rdparty/libprocess/m4/ltoptions.m4 Removing 3rdparty/libprocess/m4/ltsugar.m4 Removing 3rdparty/libprocess/m4/ltversion.m4 Removing 3rdparty/libprocess/m4/lt~obsolete.m4 Removing 3rdparty/libprocess/missing Removing Makefile Removing Makefile.in Removing aclocal.m4 Removing ar-lib Removing autom4te.cache/ Removing bin/gdb-mesos-local.sh Removing bin/gdb-mesos-master.sh Removing bin/gdb-mesos-slave.sh Removing bin/gdb-mesos-tests.sh Removing bin/lldb-mesos-local.sh Removing bin/lldb-mesos-master.sh Removing bin/lldb-mesos-slave.sh Removing bin/lldb-mesos-tests.sh Removing bin/mesos-local-flags.sh Removing bin/mesos-local.sh Removing bin/mesos-master-flags.sh Removing bin/mesos-master.sh Removing bin/mesos-slave-flags.sh Removing bin/mesos-slave.sh Removing bin/mesos-tests-flags.sh Removing bin/mesos-tests.sh Removing bin/mesos.sh Removing bin/valgrind-mesos-local.sh Removing bin/valgrind-mesos-master.sh Removing bin/valgrind-mesos-slave.sh Removing bin/valgrind-mesos-tests.sh Removing compile Removing config.guess Removing config.log Removing config.lt Removing config.status Removing config.sub Removing configure Removing depcomp Removing ec2/Makefile Removing ec2/Makefile.in Removing include/mesos/mesos.hpp Removing install-sh Removing libtool Removing ltmain.sh Removing m4/libtool.m4 Removing m4/ltoptions.m4 Removing m4/ltsugar.m4 Removing m4/ltversion.m4 Removing m4/lt~obsolete.m4 Removing mesos-0.21.0.tar.gz Removing mesos.pc Removing missing Removing mpi/mpiexec-mesos Removing src/.deps/ Removing src/Makefile Removing src/Makefile.in Removing src/authorizer/.deps/ Removing src/cli/.deps/ Removing src/common/.deps/ Removing src/containerizer/ Removing src/deploy/mesos-daemon.sh Removing src/deploy/mesos-start-cluster.sh Removing src/deploy/mesos-start-masters.sh Removing src/deploy/mesos-start-slaves.sh Removing src/deploy/mesos-stop-cluster.sh Removing src/deploy/mesos-stop-masters.sh Removing src/deploy/mesos-stop-slaves.sh Removing src/docker/.deps/ Removing src/examples/.deps/ Removing src/examples/java/test-exception-framework Removing src/examples/java/test-executor Removing src/examples/java/test-framework Removing src/examples/java/test-log Removing src/examples/java/test-multiple-executors-framework Removing src/examples/python/test-containerizer Removing src/examples/python/test-executor Removing
Build failed in Jenkins: mesos-reviewbot #1508
See https://builds.apache.org/job/mesos-reviewbot/1508/changes Changes: [niklas] Fixed line comments end punctuation in Mesos source. [niklas] Fixed line comments end punctuation in stout. [niklas] Fixed line comments end punctuation in libprocess. [dlester] Adds Qubit to PoweredByMesos list. -- [...truncated 5704 lines...] rm -f scheduler/.deps/.dirstamp rm -f slave/containerizer/isolators/cgroups/*.lo rm -f scheduler/.dirstamp rm -f slave/containerizer/isolators/network/*.o rm -f slave/.deps/.dirstamp rm -f slave/containerizer/isolators/network/*.lo rm -f slave/.dirstamp rm -f slave/containerizer/mesos/*.o rm -f slave/containerizer/.deps/.dirstamp rm -f slave/containerizer/mesos/*.lo rm -f slave/containerizer/.dirstamp rm -f state/*.o rm -f slave/containerizer/isolators/cgroups/.deps/.dirstamp rm -f state/*.lo rm -f slave/containerizer/isolators/cgroups/.dirstamp rm -f tests/*.o rm -f slave/containerizer/isolators/network/.deps/.dirstamp rm -f slave/containerizer/isolators/network/.dirstamp rm -f slave/containerizer/mesos/.deps/.dirstamp rm -f slave/containerizer/mesos/.dirstamp rm -f state/.deps/.dirstamp rm -f state/.dirstamp rm -f tests/.deps/.dirstamp rm -f tests/.dirstamp rm -f tests/common/.deps/.dirstamp rm -f tests/common/.dirstamp rm -f usage/.deps/.dirstamp rm -f usage/.dirstamp rm -f zookeeper/.deps/.dirstamp rm -f zookeeper/.dirstamp rm -rf authorizer/.libs authorizer/_libs rm -rf common/.libs common/_libs rm -rf containerizer/.libs containerizer/_libs rm -rf docker/.libs docker/_libs rm -rf exec/.libs exec/_libs rm -rf files/.libs files/_libs rm -rf java/jni/.libs java/jni/_libs rm -rf jvm/.libs jvm/_libs rm -rf jvm/org/apache/.libs jvm/org/apache/_libs rm -rf linux/.libs linux/_libs rm -rf linux/routing/.libs linux/routing/_libs rm -rf linux/routing/filter/.libs linux/routing/filter/_libs rm -rf linux/routing/link/.libs linux/routing/link/_libs rm -rf linux/routing/queueing/.libs linux/routing/queueing/_libs rm -rf local/.libs local/_libs rm -rf log/.libs log/_libs rm -rf log/tool/.libs log/tool/_libs rm -f tests/common/*.o rm -rf logging/.libs logging/_libs rm -f usage/*.o rm -rf master/.libs master/_libs rm -f usage/*.lo rm -f zookeeper/*.o rm -f zookeeper/*.lo rm -rf messages/.libs messages/_libs rm -rf sasl/.libs sasl/_libs rm -rf sched/.libs sched/_libs rm -rf scheduler/.libs scheduler/_libs rm -rf slave/.libs slave/_libs rm -rf slave/containerizer/.libs slave/containerizer/_libs rm -rf slave/containerizer/isolators/cgroups/.libs slave/containerizer/isolators/cgroups/_libs rm -rf slave/containerizer/isolators/network/.libs slave/containerizer/isolators/network/_libs rm -rf slave/containerizer/mesos/.libs slave/containerizer/mesos/_libs rm -rf state/.libs state/_libs rm -rf usage/.libs usage/_libs rm -rf zookeeper/.libs zookeeper/_libs rm -rf ./.deps authorizer/.deps cli/.deps common/.deps containerizer/.deps docker/.deps examples/.deps exec/.deps files/.deps health-check/.deps java/jni/.deps jvm/.deps jvm/org/apache/.deps launcher/.deps linux/.deps linux/routing/.deps linux/routing/filter/.deps linux/routing/link/.deps linux/routing/queueing/.deps local/.deps log/.deps log/tool/.deps logging/.deps master/.deps messages/.deps sasl/.deps sched/.deps scheduler/.deps slave/.deps slave/containerizer/.deps slave/containerizer/isolators/cgroups/.deps slave/containerizer/isolators/network/.deps slave/containerizer/mesos/.deps state/.deps tests/.deps tests/common/.deps usage/.deps zookeeper/.deps rm -f Makefile make[2]: Leaving directory `https://builds.apache.org/job/mesos-reviewbot/ws/mesos-0.21.0/_build/src' Making distclean in ec2 make[2]: Entering directory `https://builds.apache.org/job/mesos-reviewbot/ws/mesos-0.21.0/_build/ec2' rm -rf .libs _libs rm -f *.lo test -z || rm -f test . = ../../ec2 || test -z || rm -f rm -f Makefile make[2]: Leaving directory `https://builds.apache.org/job/mesos-reviewbot/ws/mesos-0.21.0/_build/ec2' rm -f config.status config.cache config.log configure.lineno config.status.lineno rm -f Makefile make[1]: Leaving directory `https://builds.apache.org/job/mesos-reviewbot/ws/mesos-0.21.0/_build' if test -d mesos-0.21.0; then find mesos-0.21.0 -type d ! -perm -200 -exec chmod u+w {} ';' rm -rf mesos-0.21.0 || { sleep 5 rm -rf mesos-0.21.0; }; else :; fi == mesos-0.21.0 archives ready for distribution: mesos-0.21.0.tar.gz == real111m33.909s user141m57.055s sys 7m58.774s + chmod -R +w 3rdparty CHANGELOG Doxyfile LICENSE Makefile Makefile.am Makefile.in NOTICE README.md aclocal.m4 ar-lib autom4te.cache bin bootstrap compile config.guess config.log config.lt config.status config.sub configure configure.ac depcomp docs ec2 frameworks include install-sh libtool ltmain.sh m4 mesos-0.21.0.tar.gz mesos.pc mesos.pc.in missing mpi src support + git clean -fdx Removing .libs/ Removing 3rdparty/Makefile
Build failed in Jenkins: mesos-reviewbot #1509
See https://builds.apache.org/job/mesos-reviewbot/1509/ -- [...truncated 5591 lines...] rm -f slave/*.lo rm -f slave/containerizer/.deps/.dirstamp rm -f slave/containerizer/*.o rm -f slave/containerizer/.dirstamp rm -f slave/containerizer/*.lo rm -f slave/containerizer/isolators/cgroups/.deps/.dirstamp rm -f slave/containerizer/isolators/cgroups/*.o rm -f slave/containerizer/isolators/cgroups/.dirstamp rm -f slave/containerizer/isolators/cgroups/*.lo rm -f slave/containerizer/isolators/network/.deps/.dirstamp rm -f slave/containerizer/isolators/network/*.o rm -f slave/containerizer/isolators/network/.dirstamp rm -f slave/containerizer/isolators/network/*.lo rm -f slave/containerizer/mesos/.deps/.dirstamp rm -f slave/containerizer/mesos/*.o rm -f slave/containerizer/mesos/.dirstamp rm -f state/.deps/.dirstamp rm -f slave/containerizer/mesos/*.lo rm -f state/.dirstamp rm -f state/*.o rm -f tests/.deps/.dirstamp rm -f state/*.lo rm -f tests/.dirstamp rm -rf authorizer/.libs authorizer/_libs rm -f tests/*.o rm -f tests/common/.deps/.dirstamp rm -rf common/.libs common/_libs rm -f tests/common/.dirstamp rm -f usage/.deps/.dirstamp rm -rf containerizer/.libs containerizer/_libs rm -f usage/.dirstamp rm -rf docker/.libs docker/_libs rm -f zookeeper/.deps/.dirstamp rm -f zookeeper/.dirstamp rm -rf exec/.libs exec/_libs rm -rf files/.libs files/_libs rm -rf java/jni/.libs java/jni/_libs rm -rf jvm/.libs jvm/_libs rm -rf jvm/org/apache/.libs jvm/org/apache/_libs rm -rf linux/.libs linux/_libs rm -rf linux/routing/.libs linux/routing/_libs rm -rf linux/routing/filter/.libs linux/routing/filter/_libs rm -rf linux/routing/link/.libs linux/routing/link/_libs rm -rf linux/routing/queueing/.libs linux/routing/queueing/_libs rm -rf local/.libs local/_libs rm -rf log/.libs log/_libs rm -rf log/tool/.libs log/tool/_libs rm -rf logging/.libs logging/_libs rm -rf master/.libs master/_libs rm -rf messages/.libs messages/_libs rm -rf sasl/.libs sasl/_libs rm -rf sched/.libs sched/_libs rm -rf scheduler/.libs scheduler/_libs rm -rf slave/.libs slave/_libs rm -f tests/common/*.o rm -rf slave/containerizer/.libs slave/containerizer/_libs rm -f usage/*.o rm -f usage/*.lo rm -f zookeeper/*.o rm -f zookeeper/*.lo rm -rf slave/containerizer/isolators/cgroups/.libs slave/containerizer/isolators/cgroups/_libs rm -rf slave/containerizer/isolators/network/.libs slave/containerizer/isolators/network/_libs rm -rf slave/containerizer/mesos/.libs slave/containerizer/mesos/_libs rm -rf state/.libs state/_libs rm -rf usage/.libs usage/_libs rm -rf zookeeper/.libs zookeeper/_libs rm -rf ./.deps authorizer/.deps cli/.deps common/.deps containerizer/.deps docker/.deps examples/.deps exec/.deps files/.deps health-check/.deps java/jni/.deps jvm/.deps jvm/org/apache/.deps launcher/.deps linux/.deps linux/routing/.deps linux/routing/filter/.deps linux/routing/link/.deps linux/routing/queueing/.deps local/.deps log/.deps log/tool/.deps logging/.deps master/.deps messages/.deps sasl/.deps sched/.deps scheduler/.deps slave/.deps slave/containerizer/.deps slave/containerizer/isolators/cgroups/.deps slave/containerizer/isolators/network/.deps slave/containerizer/mesos/.deps state/.deps tests/.deps tests/common/.deps usage/.deps zookeeper/.deps rm -f Makefile make[2]: Leaving directory `https://builds.apache.org/job/mesos-reviewbot/ws/mesos-0.21.0/_build/src' Making distclean in ec2 make[2]: Entering directory `https://builds.apache.org/job/mesos-reviewbot/ws/mesos-0.21.0/_build/ec2' rm -rf .libs _libs rm -f *.lo test -z || rm -f test . = ../../ec2 || test -z || rm -f rm -f Makefile make[2]: Leaving directory `https://builds.apache.org/job/mesos-reviewbot/ws/mesos-0.21.0/_build/ec2' rm -f config.status config.cache config.log configure.lineno config.status.lineno rm -f Makefile make[1]: Leaving directory `https://builds.apache.org/job/mesos-reviewbot/ws/mesos-0.21.0/_build' if test -d mesos-0.21.0; then find mesos-0.21.0 -type d ! -perm -200 -exec chmod u+w {} ';' rm -rf mesos-0.21.0 || { sleep 5 rm -rf mesos-0.21.0; }; else :; fi == mesos-0.21.0 archives ready for distribution: mesos-0.21.0.tar.gz == real76m42.235s user142m15.842s sys 7m52.073s + chmod -R +w 3rdparty CHANGELOG Doxyfile LICENSE Makefile Makefile.am Makefile.in NOTICE README.md aclocal.m4 ar-lib autom4te.cache bin bootstrap compile config.guess config.log config.lt config.status config.sub configure configure.ac depcomp docs ec2 frameworks include install-sh libtool ltmain.sh m4 mesos-0.21.0.tar.gz mesos.pc mesos.pc.in missing mpi src support + git clean -fdx Removing .libs/ Removing 3rdparty/Makefile Removing 3rdparty/Makefile.in Removing 3rdparty/libprocess/.deps/ Removing 3rdparty/libprocess/3rdparty/.deps/ Removing 3rdparty/libprocess/3rdparty/Makefile Removing 3rdparty/libprocess/3rdparty/Makefile.in Removing
Dynamic Resource Roles
Hey everyone, Just a quick question. Has the ever been any discussion around dynamic roles? What I mean by this – currently if I want to guarantee 1 core and 10 GB of ram to a specific type of framework (or role) I need to do this at a slave level. This means if I only want to guarantee a small number of resources, I could do this on one slave. If that slave dies, that resource is no longer available. It would be interesting to see the master (DRF scheduler) capable of reserving a minimum about of resource for offering only to frameworks of a certain role, such that I can guarantee R amount of resources on N slaves across the cluster as a whole. Tom.
Re: Build failed in Jenkins: Mesos-Trunk-Ubuntu-Build-Out-Of-Src-Disable-Java-Disable-Python-Disable-Webui #2358
Just flying cover, but my change is unrelated to the issue. However, I see this issue quite often as well. Cheers, Tim - Original Message - From: Yan Xu y...@jxu.me To: dev@mesos.apache.org Cc: Vinod Kone vinodk...@gmail.com Sent: Tuesday, September 9, 2014 4:50:06 PM Subject: Re: Build failed in Jenkins: Mesos-Trunk-Ubuntu-Build-Out-Of-Src-Disable-Java-Disable-Python-Disable-Webui #2358 this is https://issues.apache.org/jira/browse/MESOS-1766 -- Jiang Yan Xu y...@jxu.me @xujyan http://twitter.com/xujyan On Fri, Sep 5, 2014 at 8:53 AM, Apache Jenkins Server jenk...@builds.apache.org wrote: See https://builds.apache.org/job/Mesos-Trunk-Ubuntu-Build-Out-Of-Src-Disable-Java-Disable-Python-Disable-Webui/2358/changes Changes: [tstclair] Minor update to include package config file -- [...truncated 57484 lines...] I0905 15:53:16.220577 25788 replica.cpp:676] Persisted action at 1 I0905 15:53:16.220588 25788 replica.cpp:661] Replica learned APPEND action at position 1 I0905 15:53:16.221040 25794 registrar.cpp:479] Successfully updated 'registry' I0905 15:53:16.221119 25795 log.cpp:699] Attempting to truncate the log to 1 I0905 15:53:16.221146 25794 registrar.cpp:372] Successfully recovered registrar I0905 15:53:16.221195 25791 coordinator.cpp:340] Coordinator attempting to write TRUNCATE action at position 2 I0905 15:53:16.221336 25797 master.cpp:1063] Recovered 0 slaves from the Registry (102B) ; allowing 10mins for slaves to re-register I0905 15:53:16.221873 25795 replica.cpp:508] Replica received write request for position 2 I0905 15:53:16.61 25795 leveldb.cpp:343] Persisting action (16 bytes) to leveldb took 362390ns I0905 15:53:16.81 25795 replica.cpp:676] Persisted action at 2 I0905 15:53:16.222586 25789 replica.cpp:655] Replica received learned notice for position 2 I0905 15:53:16.222740 25789 leveldb.cpp:343] Persisting action (18 bytes) to leveldb took 129933ns I0905 15:53:16.222772 25789 leveldb.cpp:401] Deleting ~1 keys from leveldb took 14255ns I0905 15:53:16.222786 25789 replica.cpp:676] Persisted action at 2 I0905 15:53:16.222796 25789 replica.cpp:661] Replica learned TRUNCATE action at position 2 I0905 15:53:16.376282 25769 sched.cpp:137] Version: 0.21.0 I0905 15:53:16.376565 25789 sched.cpp:233] New master detected at master@67.195.81.186:49188 I0905 15:53:16.376590 25789 sched.cpp:283] Authenticating with master master@67.195.81.186:49188 I0905 15:53:16.376866 25784 authenticatee.hpp:128] Creating new client SASL connection I0905 15:53:16.376965 25784 master.cpp:3637] Authenticating scheduler-002519ef-8af3-45c8-bb43-fc4662045bc7@67.195.81.186:49188 I0905 15:53:16.377059 25796 authenticator.hpp:156] Creating new server SASL connection I0905 15:53:16.377255 25789 authenticatee.hpp:219] Received SASL authentication mechanisms: CRAM-MD5 I0905 15:53:16.377290 25789 authenticatee.hpp:245] Attempting to authenticate with mechanism 'CRAM-MD5' I0905 15:53:16.377455 25793 authenticator.hpp:262] Received SASL authentication start I0905 15:53:16.377508 25793 authenticator.hpp:384] Authentication requires more steps I0905 15:53:16.377614 25786 authenticatee.hpp:265] Received SASL authentication step I0905 15:53:16.377678 25786 authenticator.hpp:290] Received SASL authentication step I0905 15:53:16.377699 25786 auxprop.cpp:81] Request to lookup properties for user: 'test-principal' realm: 'penates.apache.org' server FQDN: ' penates.apache.org' SASL_AUXPROP_VERIFY_AGAINST_HASH: false SASL_AUXPROP_OVERRIDE: false SASL_AUXPROP_AUTHZID: false I0905 15:53:16.377710 25786 auxprop.cpp:153] Looking up auxiliary property '*userPassword' I0905 15:53:16.377723 25786 auxprop.cpp:153] Looking up auxiliary property '*cmusaslsecretCRAM-MD5' I0905 15:53:16.377737 25786 auxprop.cpp:81] Request to lookup properties for user: 'test-principal' realm: 'penates.apache.org' server FQDN: ' penates.apache.org' SASL_AUXPROP_VERIFY_AGAINST_HASH: false SASL_AUXPROP_OVERRIDE: false SASL_AUXPROP_AUTHZID: true I0905 15:53:16.377745 25786 auxprop.cpp:103] Skipping auxiliary property '*userPassword' since SASL_AUXPROP_AUTHZID == true I0905 15:53:16.377753 25786 auxprop.cpp:103] Skipping auxiliary property '*cmusaslsecretCRAM-MD5' since SASL_AUXPROP_AUTHZID == true I0905 15:53:16.377768 25786 authenticator.hpp:376] Authentication success I0905 15:53:16.377856 25798 authenticatee.hpp:305] Authentication success I0905 15:53:16.377874 25796 master.cpp:3677] Successfully authenticated principal 'test-principal' at scheduler-002519ef-8af3-45c8-bb43-fc4662045bc7@67.195.81.186:49188 I0905 15:53:16.378038 25798 sched.cpp:357] Successfully authenticated with master master@67.195.81.186:49188 I0905 15:53:16.378059 25798 sched.cpp:476] Sending registration request to master@67.195.81.186:49188 I0905
Re: Dynamic Resource Roles
Hi Tom, Reservations is definitely something we've discussed and will be addressed in the near future. Tim On Sep 10, 2014, at 7:49 AM, Tom Arnfeld t...@duedil.com wrote: Hey everyone, Just a quick question. Has the ever been any discussion around dynamic roles? What I mean by this – currently if I want to guarantee 1 core and 10 GB of ram to a specific type of framework (or role) I need to do this at a slave level. This means if I only want to guarantee a small number of resources, I could do this on one slave. If that slave dies, that resource is no longer available. It would be interesting to see the master (DRF scheduler) capable of reserving a minimum about of resource for offering only to frameworks of a certain role, such that I can guarantee R amount of resources on N slaves across the cluster as a whole. Tom.
Re: Dynamic Resource Roles
That's very cool, thanks. On Wed, Sep 10, 2014 at 4:59 PM, Timothy Chen tnac...@gmail.com wrote: Hi Tom, Reservations is definitely something we've discussed and will be addressed in the near future. Tim On Sep 10, 2014, at 7:49 AM, Tom Arnfeld t...@duedil.com wrote: Hey everyone, Just a quick question. Has the ever been any discussion around dynamic roles? What I mean by this – currently if I want to guarantee 1 core and 10 GB of ram to a specific type of framework (or role) I need to do this at a slave level. This means if I only want to guarantee a small number of resources, I could do this on one slave. If that slave dies, that resource is no longer available. It would be interesting to see the master (DRF scheduler) capable of reserving a minimum about of resource for offering only to frameworks of a certain role, such that I can guarantee R amount of resources on N slaves across the cluster as a whole. Tom.
Re: Review Request 25487: Increased session timeouts for ZooKeeper related tests.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/25487/#review52884 --- src/tests/zookeeper.cpp https://reviews.apache.org/r/25487/#comment92073 Seconds(10) ? - Dominic Hamon On Sept. 9, 2014, 10:57 p.m., Jiang Yan Xu wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/25487/ --- (Updated Sept. 9, 2014, 10:57 p.m.) Review request for mesos and Ben Mahler. Bugs: MESOS-1676 https://issues.apache.org/jira/browse/MESOS-1676 Repository: mesos-git Description --- - On slower machines sometimes the zookeeper c client times out where we aren't expecting because either the test server or the client is too slow to respond. Increasing this value helps mitigate the problem. - The effect of server-shutdownNetwork() is immediate so this won't prolong the tests so long as they don't wait for session expiration without clock advances, which I have checked and there is none. Diffs - src/tests/master_contender_detector_tests.cpp 9ac59aa446a132e734238e0e55801117c4ef31b4 src/tests/zookeeper.cpp e45f956e1486e952a4efeb123e15568518fb53fe Diff: https://reviews.apache.org/r/25487/diff/ Testing --- make check. Thanks, Jiang Yan Xu
Review Request 25508: Fix git clean -xdf skipping leveldb
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/25508/ --- Review request for mesos, Jie Yu and Vinod Kone. Bugs: MESOS-1764 https://issues.apache.org/jira/browse/MESOS-1764 Repository: mesos-git Description --- Very minor change to allow git clean -xdf to remove the leveldb directory Diffs - 3rdparty/Makefile.am 7cf0c88 Diff: https://reviews.apache.org/r/25508/diff/ Testing --- make check Thanks, Timothy St. Clair
Re: Review Request 25434: Propagate slave shutdown grace period to Executor and CommandExecutor.
On Sept. 9, 2014, 5:50 p.m., Benjamin Hindman wrote: src/slave/constants.hpp, line 53 https://reviews.apache.org/r/25434/diff/2/?file=683947#file683947line53 What is the 'base executor' versus the 'command executor'? Alexander Rukletsov wrote: We have Executor (lives in src/exec/exec.cpp) and CommandExecutor aka mesos-executor (lives in src/launcher/executor.cpp). I find executor too vague and use base executor to stress out I mean the one that lives in exec.cpp. Is there a convention about naming these folks? Benjamin Hindman wrote: Ah, I see. Well, CommandExecutor is just an instance of an executor and actually uses the code from exec.cpp just like all current executors do (that use libmesos). So there aren't actually two executors (base and command), just one, and they all use exec.cpp (if they use libmesos). Does that make sense? Alexander Rukletsov wrote: Sorry, I was inexact in my comment. Indeed, there is only one executor, but two libprocess processes (where all most of the work is done). Here is what we have: Executor - ExecutorProcess (I call them both base executor, though base executor process is more correct) CommandExecutor - CommandExecutorProcess The OS process where CommandExecutor lives instantiates also the driver, together it looks like this: ___ MesosExecutorDriver * ExecutorProcess | V * CommandExecutor - CommandExecutorProcess | V task ___ My aim was to explain that there is a wrapper around the CommandExecutorProcess which has its own shutdown period. For simplicity I called this wrapper (which is ExecutorProcess and a bit MesosExecutorDriver) base executor. However, it looks like my terminology is not good enough and maybe even misleading. What terms would you suggest, Ben? IMHO we should just use the terms ExecutorProcess, MesosExecutorDriver, etc. If you want to alias them within that comment then I'd suggest defining the alias (as you've done for me here) and then using that alias in the comment. That being said, my hunch is that you'll get more milage just using the class names. This will also make renaming/refactoring easier as code searches will grab these comments too. - Benjamin --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/25434/#review52750 --- On Sept. 9, 2014, 12:54 p.m., Alexander Rukletsov wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/25434/ --- (Updated Sept. 9, 2014, 12:54 p.m.) Review request for mesos, Niklas Nielsen, Till Toenshoff, and Timothy St. Clair. Bugs: MESOS-1571 https://issues.apache.org/jira/browse/MESOS-1571 Repository: mesos-git Description --- The configurable slave's executor_shutdown_grace_period flag is propagated to Executor and CommandExecutor through an environment variable. Shutdown timeout in Executor and signal escalation timeout in CommandExecutor are now dependent on this flag. Each nested timeout is somewhat shorter than the parent one. Diffs - src/exec/exec.cpp 36d1778 src/launcher/executor.cpp 12ac14b src/slave/constants.hpp 9030871 src/slave/constants.cpp e1da5c0 src/slave/containerizer/containerizer.hpp 8a66412 src/slave/containerizer/containerizer.cpp 0254679 src/slave/containerizer/docker.cpp 0febbac src/slave/containerizer/external_containerizer.cpp efbc68f src/slave/containerizer/mesos/containerizer.cpp 9d08329 src/slave/flags.hpp 21e0021 src/tests/containerizer.cpp a17e1e0 Diff: https://reviews.apache.org/r/25434/diff/ Testing --- make check (OS X 10.9.4; Ubuntu 14.04 amd64) Thanks, Alexander Rukletsov
Re: Dynamic Resource Roles
BenH has been calling these master reservations (globally control reservations across all slaves through the master) and offer reservations (I don't care which nodes it's on, as long as I get X cpu and Y RAM, or Z sets of {X,Y}), and they're definitely on the roadmap. On Wed, Sep 10, 2014 at 9:05 AM, Tom Arnfeld t...@duedil.com wrote: That's very cool, thanks. On Wed, Sep 10, 2014 at 4:59 PM, Timothy Chen tnac...@gmail.com wrote: Hi Tom, Reservations is definitely something we've discussed and will be addressed in the near future. Tim On Sep 10, 2014, at 7:49 AM, Tom Arnfeld t...@duedil.com wrote: Hey everyone, Just a quick question. Has the ever been any discussion around dynamic roles? What I mean by this – currently if I want to guarantee 1 core and 10 GB of ram to a specific type of framework (or role) I need to do this at a slave level. This means if I only want to guarantee a small number of resources, I could do this on one slave. If that slave dies, that resource is no longer available. It would be interesting to see the master (DRF scheduler) capable of reserving a minimum about of resource for offering only to frameworks of a certain role, such that I can guarantee R amount of resources on N slaves across the cluster as a whole. Tom.
Re: Review Request 25434: Propagate slave shutdown grace period to Executor and CommandExecutor.
On Sept. 9, 2014, 5:50 p.m., Benjamin Hindman wrote: src/slave/constants.hpp, line 53 https://reviews.apache.org/r/25434/diff/2/?file=683947#file683947line53 What is the 'base executor' versus the 'command executor'? Alexander Rukletsov wrote: We have Executor (lives in src/exec/exec.cpp) and CommandExecutor aka mesos-executor (lives in src/launcher/executor.cpp). I find executor too vague and use base executor to stress out I mean the one that lives in exec.cpp. Is there a convention about naming these folks? Benjamin Hindman wrote: Ah, I see. Well, CommandExecutor is just an instance of an executor and actually uses the code from exec.cpp just like all current executors do (that use libmesos). So there aren't actually two executors (base and command), just one, and they all use exec.cpp (if they use libmesos). Does that make sense? Alexander Rukletsov wrote: Sorry, I was inexact in my comment. Indeed, there is only one executor, but two libprocess processes (where all most of the work is done). Here is what we have: Executor - ExecutorProcess (I call them both base executor, though base executor process is more correct) CommandExecutor - CommandExecutorProcess The OS process where CommandExecutor lives instantiates also the driver, together it looks like this: ___ MesosExecutorDriver * ExecutorProcess | V * CommandExecutor - CommandExecutorProcess | V task ___ My aim was to explain that there is a wrapper around the CommandExecutorProcess which has its own shutdown period. For simplicity I called this wrapper (which is ExecutorProcess and a bit MesosExecutorDriver) base executor. However, it looks like my terminology is not good enough and maybe even misleading. What terms would you suggest, Ben? Benjamin Hindman wrote: IMHO we should just use the terms ExecutorProcess, MesosExecutorDriver, etc. If you want to alias them within that comment then I'd suggest defining the alias (as you've done for me here) and then using that alias in the comment. That being said, my hunch is that you'll get more milage just using the class names. This will also make renaming/refactoring easier as code searches will grab these comments too. Ok, agreed. - Alexander --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/25434/#review52750 --- On Sept. 9, 2014, 12:54 p.m., Alexander Rukletsov wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/25434/ --- (Updated Sept. 9, 2014, 12:54 p.m.) Review request for mesos, Niklas Nielsen, Till Toenshoff, and Timothy St. Clair. Bugs: MESOS-1571 https://issues.apache.org/jira/browse/MESOS-1571 Repository: mesos-git Description --- The configurable slave's executor_shutdown_grace_period flag is propagated to Executor and CommandExecutor through an environment variable. Shutdown timeout in Executor and signal escalation timeout in CommandExecutor are now dependent on this flag. Each nested timeout is somewhat shorter than the parent one. Diffs - src/exec/exec.cpp 36d1778 src/launcher/executor.cpp 12ac14b src/slave/constants.hpp 9030871 src/slave/constants.cpp e1da5c0 src/slave/containerizer/containerizer.hpp 8a66412 src/slave/containerizer/containerizer.cpp 0254679 src/slave/containerizer/docker.cpp 0febbac src/slave/containerizer/external_containerizer.cpp efbc68f src/slave/containerizer/mesos/containerizer.cpp 9d08329 src/slave/flags.hpp 21e0021 src/tests/containerizer.cpp a17e1e0 Diff: https://reviews.apache.org/r/25434/diff/ Testing --- make check (OS X 10.9.4; Ubuntu 14.04 amd64) Thanks, Alexander Rukletsov
Re: Review Request 25508: Fix git clean -xdf skipping leveldb
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/25508/#review52894 --- 3rdparty/Makefile.am https://reviews.apache.org/r/25508/#comment92088 didn't realize that the leveldb we bundle has git files in it! isn't the proper fix here to bundle a proper 'dist'ribution of leveldb instead of its git tree? - Vinod Kone On Sept. 10, 2014, 4:30 p.m., Timothy St. Clair wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/25508/ --- (Updated Sept. 10, 2014, 4:30 p.m.) Review request for mesos, Jie Yu and Vinod Kone. Bugs: MESOS-1764 https://issues.apache.org/jira/browse/MESOS-1764 Repository: mesos-git Description --- Very minor change to allow git clean -xdf to remove the leveldb directory Diffs - 3rdparty/Makefile.am 7cf0c88 Diff: https://reviews.apache.org/r/25508/diff/ Testing --- make check Thanks, Timothy St. Clair
Re: Review Request 25508: Fix git clean -xdf skipping leveldb
On Sept. 10, 2014, 5:26 p.m., Vinod Kone wrote: 3rdparty/Makefile.am, line 84 https://reviews.apache.org/r/25508/diff/1/?file=684613#file684613line84 didn't realize that the leveldb we bundle has git files in it! isn't the proper fix here to bundle a proper 'dist'ribution of leveldb instead of its git tree? You're probably right. I didn't want to rethunk a tarball though. - Timothy --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/25508/#review52894 --- On Sept. 10, 2014, 4:30 p.m., Timothy St. Clair wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/25508/ --- (Updated Sept. 10, 2014, 4:30 p.m.) Review request for mesos, Jie Yu and Vinod Kone. Bugs: MESOS-1764 https://issues.apache.org/jira/browse/MESOS-1764 Repository: mesos-git Description --- Very minor change to allow git clean -xdf to remove the leveldb directory Diffs - 3rdparty/Makefile.am 7cf0c88 Diff: https://reviews.apache.org/r/25508/diff/ Testing --- make check Thanks, Timothy St. Clair
Re: Review Request 25439: Fix protobuf detection on systems with Python 3 as default
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/25439/#review52898 --- Ship it! Ship It! - Timothy St. Clair On Sept. 9, 2014, 2:49 p.m., Kamil Domanski wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/25439/ --- (Updated Sept. 9, 2014, 2:49 p.m.) Review request for mesos and Timothy St. Clair. Bugs: MESOS-1774 https://issues.apache.org/jira/browse/MESOS-1774 Repository: mesos-git Description --- MESOS-1774 Diffs - m4/ac_python_module.m4 8360b65434e3c1912e2b8670f70e4130352a3c92 Diff: https://reviews.apache.org/r/25439/diff/ Testing --- ./configure --disable-bundled Thanks, Kamil Domanski
Re: Review Request 25487: Increased session timeouts for ZooKeeper related tests.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/25487/ --- (Updated Sept. 10, 2014, 11 a.m.) Review request for mesos and Ben Mahler. Changes --- Minor fix per Dominic's review. Bugs: MESOS-1676 https://issues.apache.org/jira/browse/MESOS-1676 Repository: mesos-git Description --- - On slower machines sometimes the zookeeper c client times out where we aren't expecting because either the test server or the client is too slow to respond. Increasing this value helps mitigate the problem. - The effect of server-shutdownNetwork() is immediate so this won't prolong the tests so long as they don't wait for session expiration without clock advances, which I have checked and there is none. Diffs (updated) - src/tests/master_contender_detector_tests.cpp 9ac59aa446a132e734238e0e55801117c4ef31b4 src/tests/zookeeper.cpp e45f956e1486e952a4efeb123e15568518fb53fe Diff: https://reviews.apache.org/r/25487/diff/ Testing --- make check. Thanks, Jiang Yan Xu
Review Request 25511: Pulled the log line in ZooKeeperTestServer::shutdownNetwork() to above the shutdown call.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/25511/ --- Review request for mesos and Ben Mahler. Repository: mesos-git Description --- - When debugging zookeeper related tests it's often more useful to know when the tests is about to shut down the ZK server to reason about the order of events. Otherwise client disconnections are often logged before this shutdown line and can be confusing. Diffs - src/tests/zookeeper_test_server.cpp a8c9b1cd8a546abdeb4d89a8fe9ebc3b3d577665 Diff: https://reviews.apache.org/r/25511/diff/ Testing --- make check. Thanks, Jiang Yan Xu
Jenkins build is back to normal : mesos-reviewbot #1510
See https://builds.apache.org/job/mesos-reviewbot/1510/
Re: Review Request 25508: Fix git clean -xdf skipping leveldb
On Sept. 10, 2014, 5:26 p.m., Vinod Kone wrote: 3rdparty/Makefile.am, line 84 https://reviews.apache.org/r/25508/diff/1/?file=684613#file684613line84 didn't realize that the leveldb we bundle has git files in it! isn't the proper fix here to bundle a proper 'dist'ribution of leveldb instead of its git tree? Timothy St. Clair wrote: You're probably right. I didn't want to rethunk a tarball though. can you try with replacing the bundled leveldb.tar.gz with this? git archive -o leveldb.tar.gz --prefix=leveldb/ HEAD (# run this from unbundled leveldb clone, e.g., mesos/build/3rdparty/leveldb) once you confirm that works, just have this patch be a replacement of the .tar.gz. sounds good? - Vinod --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/25508/#review52894 --- On Sept. 10, 2014, 4:30 p.m., Timothy St. Clair wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/25508/ --- (Updated Sept. 10, 2014, 4:30 p.m.) Review request for mesos, Jie Yu and Vinod Kone. Bugs: MESOS-1764 https://issues.apache.org/jira/browse/MESOS-1764 Repository: mesos-git Description --- Very minor change to allow git clean -xdf to remove the leveldb directory Diffs - 3rdparty/Makefile.am 7cf0c88 Diff: https://reviews.apache.org/r/25508/diff/ Testing --- make check Thanks, Timothy St. Clair
Re: Review Request 25508: Fix git clean -xdf skipping leveldb
On Sept. 10, 2014, 5:26 p.m., Vinod Kone wrote: 3rdparty/Makefile.am, line 84 https://reviews.apache.org/r/25508/diff/1/?file=684613#file684613line84 didn't realize that the leveldb we bundle has git files in it! isn't the proper fix here to bundle a proper 'dist'ribution of leveldb instead of its git tree? Timothy St. Clair wrote: You're probably right. I didn't want to rethunk a tarball though. Vinod Kone wrote: can you try with replacing the bundled leveldb.tar.gz with this? git archive -o leveldb.tar.gz --prefix=leveldb/ HEAD (# run this from unbundled leveldb clone, e.g., mesos/build/3rdparty/leveldb) once you confirm that works, just have this patch be a replacement of the .tar.gz. sounds good? So if you navigate into the leveldb folder and run: 'git archive --format=tar.gz --prefix=leveldb/ origin/master leveldb.tar.gz' mv -f leveldb.tar.gz ../ It works, but it produces a large binary diff. How do you want to handle *this? - Timothy --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/25508/#review52894 --- On Sept. 10, 2014, 4:30 p.m., Timothy St. Clair wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/25508/ --- (Updated Sept. 10, 2014, 4:30 p.m.) Review request for mesos, Jie Yu and Vinod Kone. Bugs: MESOS-1764 https://issues.apache.org/jira/browse/MESOS-1764 Repository: mesos-git Description --- Very minor change to allow git clean -xdf to remove the leveldb directory Diffs - 3rdparty/Makefile.am 7cf0c88 Diff: https://reviews.apache.org/r/25508/diff/ Testing --- make check Thanks, Timothy St. Clair
Review Request 25512: Made sure IPv6 is disabled for port mapping network isolator.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/25512/ --- Review request for mesos, Chi Zhang, Vinod Kone, and Cong Wang. Repository: mesos-git Description --- See summary. Since we are not forwarding IPv6 packets, it doesn't make sense to enable ipv6. By disabling IPv6, we won't get spamming kernel log warning duplicated IPv6 addresses since all veth have the same mac. Diffs - src/slave/containerizer/isolators/network/port_mapping.cpp 938782ae2ab1da34eb316381131e9bfcb7c810d1 Diff: https://reviews.apache.org/r/25512/diff/ Testing --- sudo make check Thanks, Jie Yu
Re: Mesos Driver aborted silently?
My guess is that your driver threw an exception while handling the offerRescinded() callback which was detected by the JNI binding (IIRC Mantis is a JVM framework?) causing it to abort the driver. Note that when a driver aborts, it will send a DeactivateFrameworkMessage to the master causing the master to deactivate the framework (but still keep it's tasks alive until the framework failover timeout). Having said that, your point regarding the scheduler not being able to detect that the driver is aborted until it makes *another* driver call is true. The driver doesn't call the error() callback when aborted for a couple reasons 1) abort() can be called by the scheduler itself, so it doesn't make too much sense to send a error() callback and 2) if abort() is causing by a JVM exception, the scheduler probably already knows of it (I'm guessing this wasn't the case for Mantis?). Perhaps these semantics are worth reconsidering. On Tue, Sep 9, 2014 at 3:14 PM, Sharma Podila spod...@netflix.com wrote: We had this problem show up yesterday, just one time, that I don't understand. Would appreciate any help. This is the sequence of events, as far as I can tell: From framework's perspective: F1: framework got an offer from a host that it decided it will not use, so it declines it F2: got scheduler call back about offer being rescinded (I believe same host that I just declined; the host was terminated by a separate decom process) F3: calling Mesos driver to kill a task shows driver status as DRIVER_ABORTED. However, there was no scheduler callback to reflect this. Wouldn't scheduler be told about driver being aborted via one of disconnected(), error(), other? From Mesos Master perspective: M1: failed to validate offer (must be in response to F1) M2: deactivating framework I am thinking that F1 was initiated by framework before that slave went down. But, the slave went down and offer rescinded in Mesos before F1 was received in Mesos master, which resulted in M1. Which should be OK, I'd imagine. But, here are two things I can't understand: 1. Why was the framework deactivated? I looked in Mesos logs and only found the below lines of interest. 2. Why was the framework not notified about being deactivated, but using the driver shows status as DRIVER_ABORTED? 2.1 Are frameworks required to periodically check the status of driver via mechanisms other than the scheduler callback? If so, what are they? As I said, this happened only once and likely is a race condition of sorts. I can't reproduce it. This sequence of events happen routinely but this error happened only once. It is nasty since then the framework just sits there with no offers and therefore no tasks get scheduled. We're on Mesos 0.18.0 (if this is specifically addressed in 0.19 or 0.20, that'd be good to know). I remember there was a reference to a problem caused when the created mesos driver gets GC'ed. However, our driver reference never goes out of scope. I have the following relevant logs from framework and Mesos master. The timestamps in the logs are from the same clock (on the same machine). From MantisMaster: 2014-09-08 20:08:46,263 WARN Thread-42 MesosSchedulerCallbackHandler - Declining offer from host 10.200.13.87 due to missing attribute value for EC2_AMI_ID - expecting [ami-5e6bc836] got [ami-28d47740] 2014-09-08 20:08:46,271 WARN Thread-58 MesosSchedulerCallbackHandler - Offer rescinded: offerID=20140908-195444-2298791946-7103-5698-5 . 2014-09-08 20:11:31,322 INFO pool-27-thread-1 VirtualMachineMasterServiceMesosImpl - Calling mesos to kill outliers-5-worker-0-7 2014-09-08 20:11:31,322 INFO pool-27-thread-1 VirtualMachineMasterServiceMesosImpl - Kill status = DRIVER_ABORTED From Mesos-Master: W0908 20:08:46.277575 5791 master.cpp:1556] Failed to validate offer 20140908-195444-2298791946-7103-5698-5 : Offer 20140908-195444-2298791946-7103-5698-5 is no longer valid I0908 20:08:46.277721 5791 master.cpp:1079] Deactivating framework MantisFramework I0908 20:08:46.278017 5789 hierarchical_allocator_process.hpp:408] Deactivated framework MantisFramework
Re: Review Request 25512: Made sure IPv6 is disabled for port mapping network isolator.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/25512/#review52913 --- Ship it! Have you confirmed/tested that this is safe? - Vinod Kone On Sept. 10, 2014, 6:26 p.m., Jie Yu wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/25512/ --- (Updated Sept. 10, 2014, 6:26 p.m.) Review request for mesos, Chi Zhang, Vinod Kone, and Cong Wang. Repository: mesos-git Description --- See summary. Since we are not forwarding IPv6 packets, it doesn't make sense to enable ipv6. By disabling IPv6, we won't get spamming kernel log warning duplicated IPv6 addresses since all veth have the same mac. Diffs - src/slave/containerizer/isolators/network/port_mapping.cpp 938782ae2ab1da34eb316381131e9bfcb7c810d1 Diff: https://reviews.apache.org/r/25512/diff/ Testing --- sudo make check Thanks, Jie Yu
Re: Review Request 25261: Check for variadic template and default/deleted function support
On Sept. 2, 2014, 7:50 p.m., Michael Park wrote: Just something to note here, there's a bug in earlier GCC versions where the access control of `= default`ed functions aren't enforced correctly. e.g. ``` class Foo { private: Foo() = default; }; class Bar { private: Bar() {} }; int main() { Foo foo; // Foo::Foo() is private but not enforced. // Bar bar; // error: 'Bar::Bar()' is private. } ``` The above code snippet compiles fine with GCC-4.6. does it work correctly for gcc-4.4? if yes, it should be fine. - Vinod --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/25261/#review52067 --- On Sept. 2, 2014, 5:57 p.m., Dominic Hamon wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/25261/ --- (Updated Sept. 2, 2014, 5:57 p.m.) Review request for mesos and Benjamin Hindman. Bugs: MESOS-1752 and MESOS-1753 https://issues.apache.org/jira/browse/MESOS-1752 https://issues.apache.org/jira/browse/MESOS-1753 Repository: mesos-git Description --- add c++11 language features to m4 macro that checks for c++11 support Diffs - m4/ax_cxx_compile_stdcxx_11.m4 07b298f151094e818287f741b3e0efd28374e82b Diff: https://reviews.apache.org/r/25261/diff/ Testing --- built with g++-4.4, the minimum compiler we support. Thanks, Dominic Hamon
Re: Review Request 25512: Made sure IPv6 is disabled for port mapping network isolator.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/25512/#review52918 --- Ship it! Maybe check if /proc/sys/net/ipv6/conf/all/disable_ipv6 exists in child script too since you did outside? - Cong Wang On Sept. 10, 2014, 6:26 p.m., Jie Yu wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/25512/ --- (Updated Sept. 10, 2014, 6:26 p.m.) Review request for mesos, Chi Zhang, Vinod Kone, and Cong Wang. Repository: mesos-git Description --- See summary. Since we are not forwarding IPv6 packets, it doesn't make sense to enable ipv6. By disabling IPv6, we won't get spamming kernel log warning duplicated IPv6 addresses since all veth have the same mac. Diffs - src/slave/containerizer/isolators/network/port_mapping.cpp 938782ae2ab1da34eb316381131e9bfcb7c810d1 Diff: https://reviews.apache.org/r/25512/diff/ Testing --- sudo make check Thanks, Jie Yu
Re: Review Request 25508: Fix git clean -xdf skipping leveldb
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/25508/#review52919 --- Ship it! So for completeness the actual diff is not posted here b/c it is a binary blob re: comments, but was the result of the following: cd 3rdparty tar -xzf leveldb.tar.gz cd leveldb git archive --format=tar.gz --prefix=leveldb/ origin/master leveldb.tar.gz mv leveldb.tar.gz ../ cd .. rm -rf leveldb then proceed to test the make mechanics as b4. git clean -xdf now removes the leveldb subdir. - Timothy St. Clair On Sept. 10, 2014, 4:30 p.m., Timothy St. Clair wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/25508/ --- (Updated Sept. 10, 2014, 4:30 p.m.) Review request for mesos, Jie Yu and Vinod Kone. Bugs: MESOS-1764 https://issues.apache.org/jira/browse/MESOS-1764 Repository: mesos-git Description --- Very minor change to allow git clean -xdf to remove the leveldb directory Diffs - 3rdparty/Makefile.am 7cf0c88 Diff: https://reviews.apache.org/r/25508/diff/ Testing --- make check Thanks, Timothy St. Clair
Re: Review Request 25512: Made sure IPv6 is disabled for port mapping network isolator.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/25512/#review52920 --- Does this mean that users that open sockets (withouth specifying) will only get a v4 socket? What happens if they try to open a v6 socket? - Ian Downes On Sept. 10, 2014, 11:26 a.m., Jie Yu wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/25512/ --- (Updated Sept. 10, 2014, 11:26 a.m.) Review request for mesos, Chi Zhang, Vinod Kone, and Cong Wang. Repository: mesos-git Description --- See summary. Since we are not forwarding IPv6 packets, it doesn't make sense to enable ipv6. By disabling IPv6, we won't get spamming kernel log warning duplicated IPv6 addresses since all veth have the same mac. Diffs - src/slave/containerizer/isolators/network/port_mapping.cpp 938782ae2ab1da34eb316381131e9bfcb7c810d1 Diff: https://reviews.apache.org/r/25512/diff/ Testing --- sudo make check Thanks, Jie Yu
Build failed in Jenkins: Mesos-Trunk-Ubuntu-Build-In-Src-Set-JAVA_HOME #2096
hostname: penates.apache.org I0910 19:26:48.788871 16941 slave.cpp:316] Slave checkpoint: false I0910 19:26:48.789297 16945 state.cpp:33] Recovering state from '/tmp/GarbageCollectorIntegrationTest_DiskUsage_rWhuy4/meta' I0910 19:26:48.789433 16945 status_update_manager.cpp:193] Recovering status update manager I0910 19:26:48.789624 16939 slave.cpp:3202] Finished recovery I0910 19:26:48.789911 16937 slave.cpp:598] New master detected at master@67.195.81.186:41538 I0910 19:26:48.789952 16937 slave.cpp:672] Authenticating with master master@67.195.81.186:41538 I0910 19:26:48.789994 16948 status_update_manager.cpp:167] New master detected at master@67.195.81.186:41538 I0910 19:26:48.790019 16937 slave.cpp:645] Detecting new master I0910 19:26:48.790046 16936 authenticatee.hpp:128] Creating new client SASL connection I0910 19:26:48.922570 16936 master.cpp:3653] Authenticating slave(206)@67.195.81.186:41538 I0910 19:26:48.922710 16934 authenticator.hpp:156] Creating new server SASL connection I0910 19:26:48.922807 16934 authenticatee.hpp:219] Received SASL authentication mechanisms: CRAM-MD5 I0910 19:26:48.922827 16934 authenticatee.hpp:245] Attempting to authenticate with mechanism 'CRAM-MD5' I0910 19:26:48.922914 16940 authenticator.hpp:262] Received SASL authentication start I0910 19:26:48.922978 16940 authenticator.hpp:384] Authentication requires more steps I0910 19:26:48.923027 16940 authenticatee.hpp:265] Received SASL authentication step I0910 19:26:48.923100 16948 authenticator.hpp:290] Received SASL authentication step I0910 19:26:48.923125 16948 auxprop.cpp:81] Request to lookup properties for user: 'test-principal' realm: 'penates.apache.org' server FQDN: 'penates.apache.org' SASL_AUXPROP_VERIFY_AGAINST_HASH: false SASL_AUXPROP_OVERRIDE: false SASL_AUXPROP_AUTHZID: false I0910 19:26:48.923135 16948 auxprop.cpp:153] Looking up auxiliary property '*userPassword' I0910 19:26:48.923147 16948 auxprop.cpp:153] Looking up auxiliary property '*cmusaslsecretCRAM-MD5' I0910 19:26:48.923159 16948 auxprop.cpp:81] Request to lookup properties for user: 'test-principal' realm: 'penates.apache.org' server FQDN: 'penates.apache.org' SASL_AUXPROP_VERIFY_AGAINST_HASH: false SASL_AUXPROP_OVERRIDE: false SASL_AUXPROP_AUTHZID: true I0910 19:26:48.923168 16948 auxprop.cpp:103] Skipping auxiliary property '*userPassword' since SASL_AUXPROP_AUTHZID == true I0910 19:26:48.923177 16948 auxprop.cpp:103] Skipping auxiliary property '*cmusaslsecretCRAM-MD5' since SASL_AUXPROP_AUTHZID == true I0910 19:26:48.923192 16948 authenticator.hpp:376] Authentication success I0910 19:26:48.923270 16947 authenticatee.hpp:305] Authentication success I0910 19:26:48.923288 16937 master.cpp:3693] Successfully authenticated principal 'test-principal' at slave(206)@67.195.81.186:41538 I0910 19:26:48.923444 16947 slave.cpp:729] Successfully authenticated with master master@67.195.81.186:41538 I0910 19:26:48.923501 16947 slave.cpp:980] Will retry registration in 7.844963ms if necessary I0910 19:26:48.923569 16946 master.cpp:2843] Registering slave at slave(206)@67.195.81.186:41538 (penates.apache.org) with id 20140910-192648-3125920579-41538-16920-0 I0910 19:26:48.923704 16937 registrar.cpp:422] Attempting to update the 'registry' I0910 19:26:48.925449 16943 log.cpp:680] Attempting to append 337 bytes to the log I0910 19:26:48.925525 16940 coordinator.cpp:340] Coordinator attempting to write APPEND action at position 3 I0910 19:26:48.925945 16942 replica.cpp:508] Replica received write request for position 3 I0910 19:26:48.926174 16942 leveldb.cpp:343] Persisting action (356 bytes) to leveldb took 207163ns I0910 19:26:48.926193 16942 replica.cpp:676] Persisted action at 3 I0910 19:26:48.926488 16939 replica.cpp:655] Replica received learned notice for position 3 I0910 19:26:48.926950 16939 leveldb.cpp:343] Persisting action (358 bytes) to leveldb took 437632ns I0910 19:26:48.926970 16939 replica.cpp:676] Persisted action at 3 I0910 19:26:48.926980 16939 replica.cpp:661] Replica learned APPEND action at position 3 I0910 19:26:48.927336 16949 registrar.cpp:479] Successfully updated 'registry' I0910 19:26:48.927433 16935 log.cpp:699] Attempting to truncate the log to 3 I0910 19:26:48.927454 16948 master.cpp:2883] Registered slave 20140910-192648-3125920579-41538-16920-0 at slave(206)@67.195.81.186:41538 (penates.apache.org) I0910 19:26:48.927476 16948 master.cpp:4126] Adding slave 20140910-192648-3125920579-41538-16920-0 at slave(206)@67.195.81.186:41538 (penates.apache.org) with cpus(*):2; mem(*):1024; disk(*):1024; ports(*):[31000-32000] I0910 19:26:48.927518 16947 coordinator.cpp:340] Coordinator attempting to write TRUNCATE action at position 4 I0910 19:26:48.927639 16940 slave.cpp:763] Registered with master master@67.195.81.186:41538; given slave ID 20140910-192648-3125920579-41538-16920-0 I0910 19:26:48.927705 16940 slave.cpp:2329] Received ping from slave-observer(184)@67.195.81.186:41538 I0910 19:26
Re: Review Request 25512: Made sure IPv6 is disabled for port mapping network isolator.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/25512/#review52923 --- Ship it! agree to check to make sure this works in dev-clusters and the kernel warning messages go away, if hasn't been. - Chi Zhang On Sept. 10, 2014, 6:26 p.m., Jie Yu wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/25512/ --- (Updated Sept. 10, 2014, 6:26 p.m.) Review request for mesos, Chi Zhang, Vinod Kone, and Cong Wang. Repository: mesos-git Description --- See summary. Since we are not forwarding IPv6 packets, it doesn't make sense to enable ipv6. By disabling IPv6, we won't get spamming kernel log warning duplicated IPv6 addresses since all veth have the same mac. Diffs - src/slave/containerizer/isolators/network/port_mapping.cpp 938782ae2ab1da34eb316381131e9bfcb7c810d1 Diff: https://reviews.apache.org/r/25512/diff/ Testing --- sudo make check Thanks, Jie Yu
Review Request 25516: Fixed authorization tests to properly deal with registration retries.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/25516/ --- Review request for mesos and Jiang Yan Xu. Bugs: MESOS-1760 and MESOS-1766 https://issues.apache.org/jira/browse/MESOS-1760 https://issues.apache.org/jira/browse/MESOS-1766 Repository: mesos-git Description --- Since the authorization tests do not control the retry behavior of the scheduler driver, it is possible for the driver to retry registrations and thus 'register_framework' authorizations. The MockAuthorizer needs to account for this by allowing all subsequent authorization attempts. Diffs - src/tests/master_authorization_tests.cpp b9aa7bf4f53e414d84f8cf4e020a645db8e5d855 Diff: https://reviews.apache.org/r/25516/diff/ Testing --- make check Thanks, Vinod Kone
Build failed in Jenkins: mesos-reviewbot #1511
See https://builds.apache.org/job/mesos-reviewbot/1511/changes Changes: [tstclair] Fix protobuf detection on systems with Python 3 as default (part2) -- [...truncated 5420 lines...] rm -f slave/containerizer/mesos/.dirstamp rm -f sasl/*.o rm -f state/.deps/.dirstamp rm -f sasl/*.lo rm -f state/.dirstamp rm -f sched/*.o rm -f tests/.deps/.dirstamp rm -f sched/*.lo rm -f tests/.dirstamp rm -f scheduler/*.o rm -f tests/common/.deps/.dirstamp rm -f scheduler/*.lo rm -f tests/common/.dirstamp rm -f slave/*.o rm -f usage/.deps/.dirstamp rm -f usage/.dirstamp rm -f zookeeper/.deps/.dirstamp rm -f zookeeper/.dirstamp rm -f slave/*.lo rm -f slave/containerizer/*.o rm -f slave/containerizer/*.lo rm -f slave/containerizer/isolators/cgroups/*.o rm -f slave/containerizer/isolators/cgroups/*.lo rm -f slave/containerizer/isolators/network/*.o rm -f slave/containerizer/isolators/network/*.lo rm -f slave/containerizer/mesos/*.o rm -f slave/containerizer/mesos/*.lo rm -f state/*.o rm -f state/*.lo rm -f tests/*.o rm -f tests/common/*.o rm -f usage/*.o rm -f usage/*.lo rm -f zookeeper/*.o rm -f zookeeper/*.lo rm -rf authorizer/.libs authorizer/_libs rm -rf common/.libs common/_libs rm -rf containerizer/.libs containerizer/_libs rm -rf docker/.libs docker/_libs rm -rf exec/.libs exec/_libs rm -rf files/.libs files/_libs rm -rf java/jni/.libs java/jni/_libs rm -rf jvm/.libs jvm/_libs rm -rf jvm/org/apache/.libs jvm/org/apache/_libs rm -rf linux/.libs linux/_libs rm -rf linux/routing/.libs linux/routing/_libs rm -rf linux/routing/filter/.libs linux/routing/filter/_libs rm -rf linux/routing/link/.libs linux/routing/link/_libs rm -rf linux/routing/queueing/.libs linux/routing/queueing/_libs rm -rf local/.libs local/_libs rm -rf log/.libs log/_libs rm -rf log/tool/.libs log/tool/_libs rm -rf logging/.libs logging/_libs rm -rf master/.libs master/_libs rm -rf messages/.libs messages/_libs rm -rf sasl/.libs sasl/_libs rm -rf sched/.libs sched/_libs rm -rf scheduler/.libs scheduler/_libs rm -rf slave/.libs slave/_libs rm -rf slave/containerizer/.libs slave/containerizer/_libs rm -rf slave/containerizer/isolators/cgroups/.libs slave/containerizer/isolators/cgroups/_libs rm -rf slave/containerizer/isolators/network/.libs slave/containerizer/isolators/network/_libs rm -rf slave/containerizer/mesos/.libs slave/containerizer/mesos/_libs rm -rf state/.libs state/_libs rm -rf usage/.libs usage/_libs rm -rf zookeeper/.libs zookeeper/_libs rm -rf ./.deps authorizer/.deps cli/.deps common/.deps containerizer/.deps docker/.deps examples/.deps exec/.deps files/.deps health-check/.deps java/jni/.deps jvm/.deps jvm/org/apache/.deps launcher/.deps linux/.deps linux/routing/.deps linux/routing/filter/.deps linux/routing/link/.deps linux/routing/queueing/.deps local/.deps log/.deps log/tool/.deps logging/.deps master/.deps messages/.deps sasl/.deps sched/.deps scheduler/.deps slave/.deps slave/containerizer/.deps slave/containerizer/isolators/cgroups/.deps slave/containerizer/isolators/network/.deps slave/containerizer/mesos/.deps state/.deps tests/.deps tests/common/.deps usage/.deps zookeeper/.deps rm -f Makefile make[2]: Leaving directory `https://builds.apache.org/job/mesos-reviewbot/ws/mesos-0.21.0/_build/src' Making distclean in ec2 make[2]: Entering directory `https://builds.apache.org/job/mesos-reviewbot/ws/mesos-0.21.0/_build/ec2' rm -rf .libs _libs rm -f *.lo test -z || rm -f test . = ../../ec2 || test -z || rm -f rm -f Makefile make[2]: Leaving directory `https://builds.apache.org/job/mesos-reviewbot/ws/mesos-0.21.0/_build/ec2' rm -f config.status config.cache config.log configure.lineno config.status.lineno rm -f Makefile make[1]: Leaving directory `https://builds.apache.org/job/mesos-reviewbot/ws/mesos-0.21.0/_build' if test -d mesos-0.21.0; then find mesos-0.21.0 -type d ! -perm -200 -exec chmod u+w {} ';' rm -rf mesos-0.21.0 || { sleep 5 rm -rf mesos-0.21.0; }; else :; fi == mesos-0.21.0 archives ready for distribution: mesos-0.21.0.tar.gz == real73m32.497s user142m31.486s sys 7m54.482s + chmod -R +w 3rdparty CHANGELOG Doxyfile LICENSE Makefile Makefile.am Makefile.in NOTICE README.md aclocal.m4 ar-lib autom4te.cache bin bootstrap compile config.guess config.log config.lt config.status config.sub configure configure.ac depcomp docs ec2 frameworks include install-sh libtool ltmain.sh m4 mesos-0.21.0.tar.gz mesos.pc mesos.pc.in missing mpi src support + git clean -fdx Removing .libs/ Removing 3rdparty/Makefile Removing 3rdparty/Makefile.in Removing 3rdparty/libprocess/.deps/ Removing 3rdparty/libprocess/3rdparty/.deps/ Removing 3rdparty/libprocess/3rdparty/Makefile Removing 3rdparty/libprocess/3rdparty/Makefile.in Removing 3rdparty/libprocess/3rdparty/gmock_sources.cc Removing 3rdparty/libprocess/3rdparty/stout/Makefile Removing
Jenkins build is back to normal : Mesos-Trunk-Ubuntu-Build-In-Src-Set-JAVA_HOME #2097
See https://builds.apache.org/job/Mesos-Trunk-Ubuntu-Build-In-Src-Set-JAVA_HOME/2097/changes
Design doc for updating FrameworkInfo
Hi folks, We have a design doc up (attached to MESOS-1784 https://issues.apache.org/jira/browse/MESOS-1784) for properly updating the FrameworkInfo. The basic idea is to provide frameworks the ability update any fields of their FrameworkInfo (e.g., 'user', 'failover_timeout') without having to restart masters/slaves/tasks/executors. Feel fee to provide feedback on the doc or the ticket. Thanks, Vinod
Re: Review Request 25035: Fix for MESOS-1688
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/25035/ --- (Updated Sept. 10, 2014, 10 nachm.) Review request for mesos and Vinod Kone. Changes --- fixed review issues Bugs: MESOS-1688 https://issues.apache.org/jira/browse/MESOS-1688 Repository: mesos-git Description --- As already explained in JIRA MESOS-1688, there are schedulers allocating memory only for the executor and not for tasks. For tasks only CPU resources are allocated in this case. Such a scheduler does not get offered any idle CPUs if the slave has nearly used up all memory. This can easily lead to a dead lock (in the application, not in Mesos). Simple example: 1. Scheduler allocates all memory of a slave for an executor 2. Scheduler launches a task for this executor (allocating 1 CPU) 3. Task finishes: 1 CPU , 0 MB memory allocatable. 4. No offers are made, as no memory is left. Scheduler will wait for offers forever. Dead lock in the application. To fix this problem, offers must be made if CPU resources are allocatable without considering allocatable memory Diffs (updated) - src/common/resources.cpp edf36b1 src/master/constants.cpp faa1503 src/master/hierarchical_allocator_process.hpp 34f8cd6 src/master/master.cpp 18464ba src/tests/allocator_tests.cpp 774528a Diff: https://reviews.apache.org/r/25035/diff/ Testing --- Deployed patched Mesos 0.19.1 on a small cluster with 3 slaves and tested running multiple parallel Spark jobs in fine-grained mode to saturate allocatable memory. The jobs run fine now. This load always caused a dead lock in all Spark jobs within one minute with the unpatched Mesos. Thanks, Martin Weindel
Re: Review Request 25035: Fix for MESOS-1688
On Sept. 9, 2014, 7:10 nachm., Vinod Kone wrote: src/master/master.cpp, line 1901 https://reviews.apache.org/r/25035/diff/4/?file=682182#file682182line1901 I like these warnings. Are you planning to get this in to 0.20.1 or 0.21.0 ? If the former, can you add this to the list of deprecations in CHANGELOG. Would be nice to see this in 0.20.1. But it is not clear to me, how to update the CHANGELOG. There is no section for upcoming releases. - Martin --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/25035/#review52763 --- On Sept. 10, 2014, 10 nachm., Martin Weindel wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/25035/ --- (Updated Sept. 10, 2014, 10 nachm.) Review request for mesos and Vinod Kone. Bugs: MESOS-1688 https://issues.apache.org/jira/browse/MESOS-1688 Repository: mesos-git Description --- As already explained in JIRA MESOS-1688, there are schedulers allocating memory only for the executor and not for tasks. For tasks only CPU resources are allocated in this case. Such a scheduler does not get offered any idle CPUs if the slave has nearly used up all memory. This can easily lead to a dead lock (in the application, not in Mesos). Simple example: 1. Scheduler allocates all memory of a slave for an executor 2. Scheduler launches a task for this executor (allocating 1 CPU) 3. Task finishes: 1 CPU , 0 MB memory allocatable. 4. No offers are made, as no memory is left. Scheduler will wait for offers forever. Dead lock in the application. To fix this problem, offers must be made if CPU resources are allocatable without considering allocatable memory Diffs - src/common/resources.cpp edf36b1 src/master/constants.cpp faa1503 src/master/hierarchical_allocator_process.hpp 34f8cd6 src/master/master.cpp 18464ba src/tests/allocator_tests.cpp 774528a Diff: https://reviews.apache.org/r/25035/diff/ Testing --- Deployed patched Mesos 0.19.1 on a small cluster with 3 slaves and tested running multiple parallel Spark jobs in fine-grained mode to saturate allocatable memory. The jobs run fine now. This load always caused a dead lock in all Spark jobs within one minute with the unpatched Mesos. Thanks, Martin Weindel
Re: Review Request 25516: Fixed authorization tests to properly deal with registration retries.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/25516/#review52960 --- Ship it! Ship It! - Jiang Yan Xu On Sept. 10, 2014, 12:55 p.m., Vinod Kone wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/25516/ --- (Updated Sept. 10, 2014, 12:55 p.m.) Review request for mesos and Jiang Yan Xu. Bugs: MESOS-1760 and MESOS-1766 https://issues.apache.org/jira/browse/MESOS-1760 https://issues.apache.org/jira/browse/MESOS-1766 Repository: mesos-git Description --- Since the authorization tests do not control the retry behavior of the scheduler driver, it is possible for the driver to retry registrations and thus 'register_framework' authorizations. The MockAuthorizer needs to account for this by allowing all subsequent authorization attempts. Diffs - src/tests/master_authorization_tests.cpp b9aa7bf4f53e414d84f8cf4e020a645db8e5d855 Diff: https://reviews.apache.org/r/25516/diff/ Testing --- make check Thanks, Vinod Kone
Completed tasks remains in TASK_RUNNING when framework is disconnected
Hi guys, We have run into a problem that cause tasks which completes, when a framework is disconnected and has a fail-over time, to remain in a running state even though the tasks actually finishes. Here is a test framework we have been able to reproduce the issue with: https://gist.github.com/nqn/9b9b1de9123a6e836f54 It launches many short-lived tasks (1 second sleep) and when killing the framework instance, the master reports the tasks as running even after several minutes: http://cl.ly/image/2R3719461e0t/Screen%20Shot%202014-09-10%20at%203.19.39%20PM.png When clicking on one of the slaves where, for example, task 49 runs; the slave knows that it completed: http://cl.ly/image/2P410L3m1O1N/Screen%20Shot%202014-09-10%20at%203.21.29%20PM.png The tasks only finish when the framework connects again (which it may never do). This is on Mesos 0.20.0, but also applies to HEAD (as of today). Do you guys have any insights into what may be going on here? Is this by-design or a bug? Thanks, Niklas
Re: Review Request 25512: Made sure IPv6 is disabled for port mapping network isolator.
On Sept. 10, 2014, 6:48 p.m., Vinod Kone wrote: Have you confirmed/tested that this is safe? Tested. - Jie --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/25512/#review52913 --- On Sept. 10, 2014, 6:26 p.m., Jie Yu wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/25512/ --- (Updated Sept. 10, 2014, 6:26 p.m.) Review request for mesos, Chi Zhang, Vinod Kone, and Cong Wang. Repository: mesos-git Description --- See summary. Since we are not forwarding IPv6 packets, it doesn't make sense to enable ipv6. By disabling IPv6, we won't get spamming kernel log warning duplicated IPv6 addresses since all veth have the same mac. Diffs - src/slave/containerizer/isolators/network/port_mapping.cpp 938782ae2ab1da34eb316381131e9bfcb7c810d1 Diff: https://reviews.apache.org/r/25512/diff/ Testing --- sudo make check Thanks, Jie Yu
Re: Review Request 25512: Made sure IPv6 is disabled for port mapping network isolator.
On Sept. 10, 2014, 7:06 p.m., Cong Wang wrote: Maybe check if /proc/sys/net/ipv6/conf/all/disable_ipv6 exists in child script too since you did outside? It's OK, if the proc file does not exist, it'll be a no-op as we don't use set -e. - Jie --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/25512/#review52918 --- On Sept. 10, 2014, 6:26 p.m., Jie Yu wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/25512/ --- (Updated Sept. 10, 2014, 6:26 p.m.) Review request for mesos, Chi Zhang, Vinod Kone, and Cong Wang. Repository: mesos-git Description --- See summary. Since we are not forwarding IPv6 packets, it doesn't make sense to enable ipv6. By disabling IPv6, we won't get spamming kernel log warning duplicated IPv6 addresses since all veth have the same mac. Diffs - src/slave/containerizer/isolators/network/port_mapping.cpp 938782ae2ab1da34eb316381131e9bfcb7c810d1 Diff: https://reviews.apache.org/r/25512/diff/ Testing --- sudo make check Thanks, Jie Yu
Re: Review Request 25512: Made sure IPv6 is disabled for port mapping network isolator.
On Sept. 10, 2014, 7:11 p.m., Ian Downes wrote: Does this mean that users that open sockets (withouth specifying) will only get a v4 socket? What happens if they try to open a v6 socket? Tested. If they open a v6 socket, IPv4 will be used for communication (unless they use IPV6_ONLY). - Jie --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/25512/#review52920 --- On Sept. 10, 2014, 6:26 p.m., Jie Yu wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/25512/ --- (Updated Sept. 10, 2014, 6:26 p.m.) Review request for mesos, Chi Zhang, Vinod Kone, and Cong Wang. Repository: mesos-git Description --- See summary. Since we are not forwarding IPv6 packets, it doesn't make sense to enable ipv6. By disabling IPv6, we won't get spamming kernel log warning duplicated IPv6 addresses since all veth have the same mac. Diffs - src/slave/containerizer/isolators/network/port_mapping.cpp 938782ae2ab1da34eb316381131e9bfcb7c810d1 Diff: https://reviews.apache.org/r/25512/diff/ Testing --- sudo make check Thanks, Jie Yu
Re: Review Request 25512: Made sure IPv6 is disabled for port mapping network isolator.
On Sept. 10, 2014, 7:40 p.m., Chi Zhang wrote: agree to check to make sure this works in dev-clusters and the kernel warning messages go away, if hasn't been. The kernel log no longer has that warning after this change. - Jie --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/25512/#review52923 --- On Sept. 10, 2014, 6:26 p.m., Jie Yu wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/25512/ --- (Updated Sept. 10, 2014, 6:26 p.m.) Review request for mesos, Chi Zhang, Vinod Kone, and Cong Wang. Repository: mesos-git Description --- See summary. Since we are not forwarding IPv6 packets, it doesn't make sense to enable ipv6. By disabling IPv6, we won't get spamming kernel log warning duplicated IPv6 addresses since all veth have the same mac. Diffs - src/slave/containerizer/isolators/network/port_mapping.cpp 938782ae2ab1da34eb316381131e9bfcb7c810d1 Diff: https://reviews.apache.org/r/25512/diff/ Testing --- sudo make check Thanks, Jie Yu
Re: Completed tasks remains in TASK_RUNNING when framework is disconnected
Here is the log of a mesos-local instance where I reproduced it: https://gist.github.com/nqn/f7ee20601199d70787c0 (Here task 10 to 19 are stuck in running state). There is a lot of output, so here is a filtered log for task 10: https://gist.github.com/nqn/a53e5ea05c5e41cd5a7d At first glance, it looks like the task can't be found when trying to forward the finish update because the running update never got acknowledged before the framework disconnected. I may be missing something here. Niklas On 10 September 2014 16:09, Niklas Nielsen nik...@mesosphere.io wrote: Hi guys, We have run into a problem that cause tasks which completes, when a framework is disconnected and has a fail-over time, to remain in a running state even though the tasks actually finishes. Here is a test framework we have been able to reproduce the issue with: https://gist.github.com/nqn/9b9b1de9123a6e836f54 It launches many short-lived tasks (1 second sleep) and when killing the framework instance, the master reports the tasks as running even after several minutes: http://cl.ly/image/2R3719461e0t/Screen%20Shot%202014-09-10%20at%203.19.39%20PM.png When clicking on one of the slaves where, for example, task 49 runs; the slave knows that it completed: http://cl.ly/image/2P410L3m1O1N/Screen%20Shot%202014-09-10%20at%203.21.29%20PM.png The tasks only finish when the framework connects again (which it may never do). This is on Mesos 0.20.0, but also applies to HEAD (as of today). Do you guys have any insights into what may be going on here? Is this by-design or a bug? Thanks, Niklas
Build failed in Jenkins: mesos-reviewbot #1512
See https://builds.apache.org/job/mesos-reviewbot/1512/changes Changes: [bmahler] Send pending tasks during re-registration. [bmahler] Made the GarbageCollector injectable into the Slave. [bmahler] Added a test for sending pending tasks during re-registration. [tstclair] Fix git clean -xdf skipping leveldb, removes internal .git dirs -- [...truncated 5607 lines...] rm -f scheduler/.dirstamp rm -f sched/*.lo rm -f slave/.deps/.dirstamp rm -f scheduler/*.o rm -f slave/.dirstamp rm -rf jvm/.libs jvm/_libs rm -f scheduler/*.lo rm -f slave/containerizer/.deps/.dirstamp rm -rf jvm/org/apache/.libs jvm/org/apache/_libs rm -f slave/*.o rm -f slave/containerizer/.dirstamp rm -rf linux/.libs linux/_libs rm -f slave/containerizer/isolators/cgroups/.deps/.dirstamp rm -f slave/containerizer/isolators/cgroups/.dirstamp rm -rf linux/routing/.libs linux/routing/_libs rm -f slave/containerizer/isolators/network/.deps/.dirstamp rm -rf linux/routing/filter/.libs linux/routing/filter/_libs rm -f slave/*.lo rm -f slave/containerizer/isolators/network/.dirstamp rm -rf linux/routing/link/.libs linux/routing/link/_libs rm -f slave/containerizer/*.o rm -f slave/containerizer/mesos/.deps/.dirstamp rm -rf linux/routing/queueing/.libs linux/routing/queueing/_libs rm -f slave/containerizer/mesos/.dirstamp rm -rf local/.libs local/_libs rm -f state/.deps/.dirstamp rm -f slave/containerizer/*.lo rm -rf log/.libs log/_libs rm -f state/.dirstamp rm -f slave/containerizer/isolators/cgroups/*.o rm -f tests/.deps/.dirstamp rm -f tests/.dirstamp rm -f slave/containerizer/isolators/cgroups/*.lo rm -f tests/common/.deps/.dirstamp rm -f slave/containerizer/isolators/network/*.o rm -f tests/common/.dirstamp rm -f slave/containerizer/isolators/network/*.lo rm -f usage/.deps/.dirstamp rm -rf log/tool/.libs log/tool/_libs rm -f slave/containerizer/mesos/*.o rm -f usage/.dirstamp rm -rf logging/.libs logging/_libs rm -f zookeeper/.deps/.dirstamp rm -f slave/containerizer/mesos/*.lo rm -rf master/.libs master/_libs rm -f state/*.o rm -f zookeeper/.dirstamp rm -f state/*.lo rm -f tests/*.o rm -rf messages/.libs messages/_libs rm -rf sasl/.libs sasl/_libs rm -rf sched/.libs sched/_libs rm -rf scheduler/.libs scheduler/_libs rm -rf slave/.libs slave/_libs rm -rf slave/containerizer/.libs slave/containerizer/_libs rm -rf slave/containerizer/isolators/cgroups/.libs slave/containerizer/isolators/cgroups/_libs rm -rf slave/containerizer/isolators/network/.libs slave/containerizer/isolators/network/_libs rm -rf slave/containerizer/mesos/.libs slave/containerizer/mesos/_libs rm -rf state/.libs state/_libs rm -rf usage/.libs usage/_libs rm -rf zookeeper/.libs zookeeper/_libs rm -f tests/common/*.o rm -f usage/*.o rm -f usage/*.lo rm -f zookeeper/*.o rm -f zookeeper/*.lo rm -rf ./.deps authorizer/.deps cli/.deps common/.deps containerizer/.deps docker/.deps examples/.deps exec/.deps files/.deps health-check/.deps java/jni/.deps jvm/.deps jvm/org/apache/.deps launcher/.deps linux/.deps linux/routing/.deps linux/routing/filter/.deps linux/routing/link/.deps linux/routing/queueing/.deps local/.deps log/.deps log/tool/.deps logging/.deps master/.deps messages/.deps sasl/.deps sched/.deps scheduler/.deps slave/.deps slave/containerizer/.deps slave/containerizer/isolators/cgroups/.deps slave/containerizer/isolators/network/.deps slave/containerizer/mesos/.deps state/.deps tests/.deps tests/common/.deps usage/.deps zookeeper/.deps rm -f Makefile make[2]: Leaving directory `https://builds.apache.org/job/mesos-reviewbot/ws/mesos-0.21.0/_build/src' Making distclean in ec2 make[2]: Entering directory `https://builds.apache.org/job/mesos-reviewbot/ws/mesos-0.21.0/_build/ec2' rm -rf .libs _libs rm -f *.lo test -z || rm -f test . = ../../ec2 || test -z || rm -f rm -f Makefile make[2]: Leaving directory `https://builds.apache.org/job/mesos-reviewbot/ws/mesos-0.21.0/_build/ec2' rm -f config.status config.cache config.log configure.lineno config.status.lineno rm -f Makefile make[1]: Leaving directory `https://builds.apache.org/job/mesos-reviewbot/ws/mesos-0.21.0/_build' if test -d mesos-0.21.0; then find mesos-0.21.0 -type d ! -perm -200 -exec chmod u+w {} ';' rm -rf mesos-0.21.0 || { sleep 5 rm -rf mesos-0.21.0; }; else :; fi == mesos-0.21.0 archives ready for distribution: mesos-0.21.0.tar.gz == real74m57.207s user144m19.003s sys 7m59.166s + chmod -R +w 3rdparty CHANGELOG Doxyfile LICENSE Makefile Makefile.am Makefile.in NOTICE README.md aclocal.m4 ar-lib autom4te.cache bin bootstrap compile config.guess config.log config.lt config.status config.sub configure configure.ac depcomp docs ec2 frameworks include install-sh libtool ltmain.sh m4 mesos-0.21.0.tar.gz mesos.pc mesos.pc.in missing mpi src support + git clean -fdx Removing .libs/ Removing 3rdparty/Makefile Removing 3rdparty/Makefile.in Removing
Re: Completed tasks remains in TASK_RUNNING when framework is disconnected
What you observed is expected because of the way the slave (specifically, the status update manager) operates. The status update manager only sends the next update for a task if a previous update (if it exists) has been acked. In your case, since TASK_RUNNING was not acked by the framework, master doesn't know about the TASK_FINISHED update that is queued up by the status update manager. If the framework never comes back, i.e., failover timeout elapses, master shuts down the framework, which releases those resources. On Wed, Sep 10, 2014 at 4:43 PM, Niklas Nielsen nik...@mesosphere.io wrote: Here is the log of a mesos-local instance where I reproduced it: https://gist.github.com/nqn/f7ee20601199d70787c0 (Here task 10 to 19 are stuck in running state). There is a lot of output, so here is a filtered log for task 10: https://gist.github.com/nqn/a53e5ea05c5e41cd5a7d At first glance, it looks like the task can't be found when trying to forward the finish update because the running update never got acknowledged before the framework disconnected. I may be missing something here. Niklas On 10 September 2014 16:09, Niklas Nielsen nik...@mesosphere.io wrote: Hi guys, We have run into a problem that cause tasks which completes, when a framework is disconnected and has a fail-over time, to remain in a running state even though the tasks actually finishes. Here is a test framework we have been able to reproduce the issue with: https://gist.github.com/nqn/9b9b1de9123a6e836f54 It launches many short-lived tasks (1 second sleep) and when killing the framework instance, the master reports the tasks as running even after several minutes: http://cl.ly/image/2R3719461e0t/Screen%20Shot%202014-09-10%20at%203.19.39%20PM.png When clicking on one of the slaves where, for example, task 49 runs; the slave knows that it completed: http://cl.ly/image/2P410L3m1O1N/Screen%20Shot%202014-09-10%20at%203.21.29%20PM.png The tasks only finish when the framework connects again (which it may never do). This is on Mesos 0.20.0, but also applies to HEAD (as of today). Do you guys have any insights into what may be going on here? Is this by-design or a bug? Thanks, Niklas
Review Request 25523: Add Docker pull to docker abstraction
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/25523/ --- Review request for mesos and Benjamin Hindman. Repository: mesos-git Description --- Add Docker pull to docker abstraction Diffs - src/docker/docker.hpp e7adedb93272209231a3a9aefecfd6ccc7802ff5 src/docker/docker.cpp af51ac9058382aede61b09e06e312ad2ce6de03e src/slave/containerizer/docker.cpp 0febbac5df4126f6c8d9a06dd0ba1668d041b34a src/tests/docker_tests.cpp 826a8c1ef1b3089d416e5775fa2cf4e5cb0c26d1 Diff: https://reviews.apache.org/r/25523/diff/ Testing --- make check Thanks, Timothy Chen
Re: Review Request 25403: Override entrypoint when shell enabled in Docker
On Sept. 9, 2014, 6:50 p.m., Benjamin Hindman wrote: src/docker/docker.cpp, line 337 https://reviews.apache.org/r/25403/diff/1/?file=680701#file680701line337 Why not move this up above as well? The Docker cli --entrypoint only allows you to put in a single string, but we actually need a array of entrypoint entries (which is what docker inspect returns). I tried --entrypoint=/bin/sh -c on the cli and it immediately failed. Therefore, I have to run this in the cli: docker run --entrypoint=/bin/sh busybox -c ls - Timothy --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/25403/#review52767 --- On Sept. 5, 2014, 10:13 p.m., Timothy Chen wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/25403/ --- (Updated Sept. 5, 2014, 10:13 p.m.) Review request for mesos, Benjamin Hindman and Jie Yu. Bugs: MESOS-1770 https://issues.apache.org/jira/browse/MESOS-1770 Repository: mesos-git Description --- Override entrypoint when shell enabled in Docker Diffs - src/docker/docker.cpp af51ac9058382aede61b09e06e312ad2ce6de03e Diff: https://reviews.apache.org/r/25403/diff/ Testing --- make check Thanks, Timothy Chen
Re: Review Request 25403: Override entrypoint when shell enabled in Docker
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/25403/ --- (Updated Sept. 11, 2014, 12:40 a.m.) Review request for mesos, Benjamin Hindman and Jie Yu. Bugs: MESOS-1770 https://issues.apache.org/jira/browse/MESOS-1770 Repository: mesos-git Description --- Review: https://reviews.apache.org/r/25403 Diffs - src/docker/docker.cpp af51ac9058382aede61b09e06e312ad2ce6de03e Diff: https://reviews.apache.org/r/25403/diff/ Testing --- make check Thanks, Timothy Chen
Re: Review Request 25403: Override entrypoint when shell enabled in Docker
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/25403/ --- (Updated Sept. 11, 2014, 12:40 a.m.) Review request for drill, Benjamin Hindman and Jie Yu. Bugs: MESOS-1770 https://issues.apache.org/jira/browse/MESOS-1770 Repository: mesos-git Description (updated) --- Review: https://reviews.apache.org/r/25403 Diffs (updated) - src/docker/docker.cpp af51ac9058382aede61b09e06e312ad2ce6de03e Diff: https://reviews.apache.org/r/25403/diff/ Testing --- make check Thanks, Timothy Chen
Re: Review Request 24776: Add docker containerizer destroy tests
On Sept. 9, 2014, 6:15 p.m., Benjamin Hindman wrote: Why did you need to mock DockerContainerizerProcess in order to write these tests? Couldn't you have just used the existing MockDockerContainerizer? I wanted to simulate having destroy called in a pull/fetching state, so I thought the only way to do so is to mock the process since the callbacks are on DockerContainerizerProcess and not the Containerizer, so the callbacks for fetch and pull blocks and I can call destroy in that state and verify it was able to destroy. - Timothy --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/24776/#review52758 --- On Aug. 16, 2014, 10:23 p.m., Timothy Chen wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/24776/ --- (Updated Aug. 16, 2014, 10:23 p.m.) Review request for mesos, Benjamin Hindman and Jie Yu. Repository: mesos-git Description --- Add docker containerizer destroy tests Diffs - src/slave/containerizer/docker.hpp fbbd45d77e5f2f74ca893552f85eb893b3dd948f src/slave/containerizer/docker.cpp fe5b29167811d4ac2fe29070c70a04f84093a6ff src/tests/docker_containerizer_tests.cpp 8654f9c787bd207f6a7b821651e0c083bea9dc8a Diff: https://reviews.apache.org/r/24776/diff/ Testing --- make check Thanks, Timothy Chen
Re: Completed tasks remains in TASK_RUNNING when framework is disconnected
The main reason is to keep status update manager simple. Also, it is very easy to enforce the order of updates to the master/framework in this model. If we allow multiple updates for a task to be in flight, it's really hard (impossible?) to ensure that we are not delivering out-of-order updates even in edge cases (failover, network partitions etc). On Wed, Sep 10, 2014 at 5:35 PM, Niklas Nielsen nik...@mesosphere.io wrote: Hey Vinod - thanks for chiming in! Is there a particular reason for only having one status in flight? Or to put it in another way, isn't that too strict behavior taken that the master state could present the most recent known state if the status update manager tried to send more than the front of the stream? Taken very long timeouts, just waiting for those to disappear seems a bit tedious and hogs the cluster. Niklas On 10 September 2014 17:18, Vinod Kone vinodk...@gmail.com wrote: What you observed is expected because of the way the slave (specifically, the status update manager) operates. The status update manager only sends the next update for a task if a previous update (if it exists) has been acked. In your case, since TASK_RUNNING was not acked by the framework, master doesn't know about the TASK_FINISHED update that is queued up by the status update manager. If the framework never comes back, i.e., failover timeout elapses, master shuts down the framework, which releases those resources. On Wed, Sep 10, 2014 at 4:43 PM, Niklas Nielsen nik...@mesosphere.io wrote: Here is the log of a mesos-local instance where I reproduced it: https://gist.github.com/nqn/f7ee20601199d70787c0 (Here task 10 to 19 are stuck in running state). There is a lot of output, so here is a filtered log for task 10: https://gist.github.com/nqn/a53e5ea05c5e41cd5a7d At first glance, it looks like the task can't be found when trying to forward the finish update because the running update never got acknowledged before the framework disconnected. I may be missing something here. Niklas On 10 September 2014 16:09, Niklas Nielsen nik...@mesosphere.io wrote: Hi guys, We have run into a problem that cause tasks which completes, when a framework is disconnected and has a fail-over time, to remain in a running state even though the tasks actually finishes. Here is a test framework we have been able to reproduce the issue with: https://gist.github.com/nqn/9b9b1de9123a6e836f54 It launches many short-lived tasks (1 second sleep) and when killing the framework instance, the master reports the tasks as running even after several minutes: http://cl.ly/image/2R3719461e0t/Screen%20Shot%202014-09-10%20at%203.19.39%20PM.png When clicking on one of the slaves where, for example, task 49 runs; the slave knows that it completed: http://cl.ly/image/2P410L3m1O1N/Screen%20Shot%202014-09-10%20at%203.21.29%20PM.png The tasks only finish when the framework connects again (which it may never do). This is on Mesos 0.20.0, but also applies to HEAD (as of today). Do you guys have any insights into what may be going on here? Is this by-design or a bug? Thanks, Niklas
Re: Completed tasks remains in TASK_RUNNING when framework is disconnected
I agree with Niklas that if the executor has sent a terminal status update to the slave, then the task is done and the master should be able to recover those resources. Only sending the oldest status update to the master, especially in the case of framework failover, prevents these resources from being recovered in a timely manner. I see a couple of options for getting around this, each with their own disadvantages. 1) Send the entire status update stream to the master. Once the master sees the terminal status update, it will removeTask and recover the resources. Future resends of the update will be forwarded to the scheduler, but the master will ignore (with warning and invalid_update++ metrics) the subsequent updates as far as its own state for the removed task is concerned. Disadvantage: Potentially sends a lot of status update messages until the scheduler reregisters and acknowledges the updates. Disadvantage2: Updates could be sent to the scheduler out of order if some updates are dropped between the slave and master. 2) Send only the oldest status update to the master, but with an annotation of the final/terminal state of the task, if any. That way the master can call removeTask to update its internal state for the task (and update the UI) and recover the resources for the task. While the scheduler is still down, the oldest update will continue to be resent and forwarded, but the master will ignore the update (with a warning as above) as far as its own internal state is concerned. When the scheduler reregisters, the update stream will be forwarded and acknowledged one-at-a-time as before, guaranteeing status update ordering to the scheduler. Disadvantage: Seems a bit hacky to tack a terminal state onto a running update. Disadvantage2: State endpoint won't show all the status updates until the entire stream actually gets forwarded+acknowledged. Thoughts? On Wed, Sep 10, 2014 at 5:55 PM, Vinod Kone vinodk...@gmail.com wrote: The main reason is to keep status update manager simple. Also, it is very easy to enforce the order of updates to the master/framework in this model. If we allow multiple updates for a task to be in flight, it's really hard (impossible?) to ensure that we are not delivering out-of-order updates even in edge cases (failover, network partitions etc). On Wed, Sep 10, 2014 at 5:35 PM, Niklas Nielsen nik...@mesosphere.io wrote: Hey Vinod - thanks for chiming in! Is there a particular reason for only having one status in flight? Or to put it in another way, isn't that too strict behavior taken that the master state could present the most recent known state if the status update manager tried to send more than the front of the stream? Taken very long timeouts, just waiting for those to disappear seems a bit tedious and hogs the cluster. Niklas On 10 September 2014 17:18, Vinod Kone vinodk...@gmail.com wrote: What you observed is expected because of the way the slave (specifically, the status update manager) operates. The status update manager only sends the next update for a task if a previous update (if it exists) has been acked. In your case, since TASK_RUNNING was not acked by the framework, master doesn't know about the TASK_FINISHED update that is queued up by the status update manager. If the framework never comes back, i.e., failover timeout elapses, master shuts down the framework, which releases those resources. On Wed, Sep 10, 2014 at 4:43 PM, Niklas Nielsen nik...@mesosphere.io wrote: Here is the log of a mesos-local instance where I reproduced it: https://gist.github.com/nqn/f7ee20601199d70787c0 (Here task 10 to 19 are stuck in running state). There is a lot of output, so here is a filtered log for task 10: https://gist.github.com/nqn/a53e5ea05c5e41cd5a7d At first glance, it looks like the task can't be found when trying to forward the finish update because the running update never got acknowledged before the framework disconnected. I may be missing something here. Niklas On 10 September 2014 16:09, Niklas Nielsen nik...@mesosphere.io wrote: Hi guys, We have run into a problem that cause tasks which completes, when a framework is disconnected and has a fail-over time, to remain in a running state even though the tasks actually finishes. Here is a test framework we have been able to reproduce the issue with: https://gist.github.com/nqn/9b9b1de9123a6e836f54 It launches many short-lived tasks (1 second sleep) and when killing the framework instance, the master reports the tasks as running even after several minutes: http://cl.ly/image/2R3719461e0t/Screen%20Shot%202014-09-10%20at%203.19.39%20PM.png When clicking on one of the slaves where, for example, task 49 runs; the slave knows that it completed:
Re: Review Request 25111: Added the concept of dynamically configurable slave attributes
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/25111/ --- (Updated Sept. 11, 2014, 1:24 a.m.) Review request for mesos, Adam B and Benjamin Hindman. Changes --- Get the test closer to passing Bugs: MESOS-1739 https://issues.apache.org/jira/browse/MESOS-1739 Repository: mesos-git Description --- Add basic stub for dynamic slave attributes Diffs (updated) - src/Makefile.am 9b973e5 src/common/attributes.hpp 0a043d5 src/common/attributes.cpp aab114e src/common/slaveinfo_utils.hpp PRE-CREATION src/common/slaveinfo_utils.cpp PRE-CREATION src/master/master.hpp b492600 src/master/master.cpp d5db24e src/slave/slave.cpp 1b3dc73 src/tests/slave_tests.cpp 69be28f Diff: https://reviews.apache.org/r/25111/diff/ Testing --- This is currently a work in progress, (WIP) Thanks, Patrick Reilly
Review Request 25525: MESOS-1739: Allow slave reconfiguration on restart
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/25525/ --- Review request for mesos, Adam B, Benjamin Hindman, Patrick Reilly, and Vinod Kone. Bugs: MESOS-1739 https://issues.apache.org/jira/browse/MESOS-1739 Repository: mesos-git Description --- Allows attributes and resources to be set to a superset of what they were previously on a slave restart. Incorporates all comments from: https://issues.apache.org/jira/browse/MESOS-1739 and the former review request: https://reviews.apache.org/r/25111/ Diffs - src/Makefile.am 9b973e5 src/common/attributes.hpp 0a043d5 src/common/attributes.cpp aab114e src/common/slaveinfo_utils.hpp PRE-CREATION src/common/slaveinfo_utils.cpp PRE-CREATION src/master/master.hpp b492600 src/master/master.cpp d5db24e src/slave/slave.cpp 1b3dc73 src/tests/slave_tests.cpp 69be28f Diff: https://reviews.apache.org/r/25525/diff/ Testing --- make check on localhost Thanks, Cody Maloney
Build failed in Jenkins: mesos-reviewbot #1513
See https://builds.apache.org/job/mesos-reviewbot/1513/changes Changes: [adam] Fixed command executor path check [yujie.jay] Made sure IPv6 is disabled for port mapping network isolator. -- [...truncated 5561 lines...] rm -f slave/containerizer/.dirstamp rm -f slave/*.o rm -f slave/containerizer/isolators/cgroups/.deps/.dirstamp rm -f slave/containerizer/isolators/cgroups/.dirstamp rm -f slave/containerizer/isolators/network/.deps/.dirstamp rm -f slave/containerizer/isolators/network/.dirstamp rm -f slave/containerizer/mesos/.deps/.dirstamp rm -f slave/*.lo rm -f slave/containerizer/mesos/.dirstamp rm -f slave/containerizer/*.o rm -f state/.deps/.dirstamp rm -f state/.dirstamp rm -f tests/.deps/.dirstamp rm -f tests/.dirstamp rm -f tests/common/.deps/.dirstamp rm -f tests/common/.dirstamp rm -f slave/containerizer/*.lo rm -f usage/.deps/.dirstamp rm -f slave/containerizer/isolators/cgroups/*.o rm -f usage/.dirstamp rm -f zookeeper/.deps/.dirstamp rm -f zookeeper/.dirstamp rm -f slave/containerizer/isolators/cgroups/*.lo rm -f slave/containerizer/isolators/network/*.o rm -f slave/containerizer/isolators/network/*.lo rm -f slave/containerizer/mesos/*.o rm -f slave/containerizer/mesos/*.lo rm -f state/*.o rm -f state/*.lo rm -f tests/*.o rm -rf authorizer/.libs authorizer/_libs rm -rf common/.libs common/_libs rm -rf containerizer/.libs containerizer/_libs rm -rf docker/.libs docker/_libs rm -rf exec/.libs exec/_libs rm -rf files/.libs files/_libs rm -rf java/jni/.libs java/jni/_libs rm -rf jvm/.libs jvm/_libs rm -rf jvm/org/apache/.libs jvm/org/apache/_libs rm -rf linux/.libs linux/_libs rm -rf linux/routing/.libs linux/routing/_libs rm -rf linux/routing/filter/.libs linux/routing/filter/_libs rm -rf linux/routing/link/.libs linux/routing/link/_libs rm -rf linux/routing/queueing/.libs linux/routing/queueing/_libs rm -rf local/.libs local/_libs rm -rf log/.libs log/_libs rm -rf log/tool/.libs log/tool/_libs rm -rf logging/.libs logging/_libs rm -rf master/.libs master/_libs rm -rf messages/.libs messages/_libs rm -rf sasl/.libs sasl/_libs rm -rf sched/.libs sched/_libs rm -rf scheduler/.libs scheduler/_libs rm -rf slave/.libs slave/_libs rm -rf slave/containerizer/.libs slave/containerizer/_libs rm -rf slave/containerizer/isolators/cgroups/.libs slave/containerizer/isolators/cgroups/_libs rm -rf slave/containerizer/isolators/network/.libs slave/containerizer/isolators/network/_libs rm -rf slave/containerizer/mesos/.libs slave/containerizer/mesos/_libs rm -rf state/.libs state/_libs rm -rf usage/.libs usage/_libs rm -rf zookeeper/.libs zookeeper/_libs rm -f tests/common/*.o rm -f usage/*.o rm -f usage/*.lo rm -f zookeeper/*.o rm -f zookeeper/*.lo rm -rf ./.deps authorizer/.deps cli/.deps common/.deps containerizer/.deps docker/.deps examples/.deps exec/.deps files/.deps health-check/.deps java/jni/.deps jvm/.deps jvm/org/apache/.deps launcher/.deps linux/.deps linux/routing/.deps linux/routing/filter/.deps linux/routing/link/.deps linux/routing/queueing/.deps local/.deps log/.deps log/tool/.deps logging/.deps master/.deps messages/.deps sasl/.deps sched/.deps scheduler/.deps slave/.deps slave/containerizer/.deps slave/containerizer/isolators/cgroups/.deps slave/containerizer/isolators/network/.deps slave/containerizer/mesos/.deps state/.deps tests/.deps tests/common/.deps usage/.deps zookeeper/.deps rm -f Makefile make[2]: Leaving directory `https://builds.apache.org/job/mesos-reviewbot/ws/mesos-0.21.0/_build/src' Making distclean in ec2 make[2]: Entering directory `https://builds.apache.org/job/mesos-reviewbot/ws/mesos-0.21.0/_build/ec2' rm -rf .libs _libs rm -f *.lo test -z || rm -f test . = ../../ec2 || test -z || rm -f rm -f Makefile make[2]: Leaving directory `https://builds.apache.org/job/mesos-reviewbot/ws/mesos-0.21.0/_build/ec2' rm -f config.status config.cache config.log configure.lineno config.status.lineno rm -f Makefile make[1]: Leaving directory `https://builds.apache.org/job/mesos-reviewbot/ws/mesos-0.21.0/_build' if test -d mesos-0.21.0; then find mesos-0.21.0 -type d ! -perm -200 -exec chmod u+w {} ';' rm -rf mesos-0.21.0 || { sleep 5 rm -rf mesos-0.21.0; }; else :; fi == mesos-0.21.0 archives ready for distribution: mesos-0.21.0.tar.gz == real71m40.401s user143m18.903s sys 7m50.486s + chmod -R +w 3rdparty CHANGELOG Doxyfile LICENSE Makefile Makefile.am Makefile.in NOTICE README.md aclocal.m4 ar-lib autom4te.cache bin bootstrap compile config.guess config.log config.lt config.status config.sub configure configure.ac depcomp docs ec2 frameworks include install-sh libtool ltmain.sh m4 mesos-0.21.0.tar.gz mesos.pc mesos.pc.in missing mpi src support + git clean -fdx Removing .libs/ Removing 3rdparty/Makefile Removing 3rdparty/Makefile.in Removing 3rdparty/libprocess/.deps/ Removing 3rdparty/libprocess/3rdparty/.deps/ Removing
Review Request 25526: catch traling spaces in style checker
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/25526/ --- Review request for mesos, Benjamin Hindman and Vinod Kone. Bugs: MESOS-1779 https://issues.apache.org/jira/browse/MESOS-1779 Repository: mesos-git Description --- fixes MESOS-1779 Diffs - support/mesos-style.py d24cb11adc06bc0ebaaa206301616c8b597f09e8 Diff: https://reviews.apache.org/r/25526/diff/ Testing --- Thanks, Kamil Domanski
Jenkins build is back to normal : mesos-reviewbot #1514
See https://builds.apache.org/job/mesos-reviewbot/1514/changes
Re: Review Request 25487: Increased session timeouts for ZooKeeper related tests.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/25487/#review52993 --- Patch looks great! Reviews applied: [25487] All tests passed. - Mesos ReviewBot On Sept. 10, 2014, 6 p.m., Jiang Yan Xu wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/25487/ --- (Updated Sept. 10, 2014, 6 p.m.) Review request for mesos and Ben Mahler. Bugs: MESOS-1676 https://issues.apache.org/jira/browse/MESOS-1676 Repository: mesos-git Description --- - On slower machines sometimes the zookeeper c client times out where we aren't expecting because either the test server or the client is too slow to respond. Increasing this value helps mitigate the problem. - The effect of server-shutdownNetwork() is immediate so this won't prolong the tests so long as they don't wait for session expiration without clock advances, which I have checked and there is none. Diffs - src/tests/master_contender_detector_tests.cpp 9ac59aa446a132e734238e0e55801117c4ef31b4 src/tests/zookeeper.cpp e45f956e1486e952a4efeb123e15568518fb53fe Diff: https://reviews.apache.org/r/25487/diff/ Testing --- make check. Thanks, Jiang Yan Xu
Re: Review Request 25526: catch traling spaces in style checker
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/25526/#review52994 --- Ship it! Thank you sir! Didn't realize this rule already existed in cpplint. - Vinod Kone On Sept. 11, 2014, 3:36 a.m., Kamil Domanski wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/25526/ --- (Updated Sept. 11, 2014, 3:36 a.m.) Review request for mesos, Benjamin Hindman and Vinod Kone. Bugs: MESOS-1779 https://issues.apache.org/jira/browse/MESOS-1779 Repository: mesos-git Description --- fixes MESOS-1779 Diffs - support/mesos-style.py d24cb11adc06bc0ebaaa206301616c8b597f09e8 Diff: https://reviews.apache.org/r/25526/diff/ Testing --- Thanks, Kamil Domanski
Re: Review Request 25487: Increased session timeouts for ZooKeeper related tests.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/25487/#review52998 --- Ship it! Ship It! - Ben Mahler On Sept. 10, 2014, 6 p.m., Jiang Yan Xu wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/25487/ --- (Updated Sept. 10, 2014, 6 p.m.) Review request for mesos and Ben Mahler. Bugs: MESOS-1676 https://issues.apache.org/jira/browse/MESOS-1676 Repository: mesos-git Description --- - On slower machines sometimes the zookeeper c client times out where we aren't expecting because either the test server or the client is too slow to respond. Increasing this value helps mitigate the problem. - The effect of server-shutdownNetwork() is immediate so this won't prolong the tests so long as they don't wait for session expiration without clock advances, which I have checked and there is none. Diffs - src/tests/master_contender_detector_tests.cpp 9ac59aa446a132e734238e0e55801117c4ef31b4 src/tests/zookeeper.cpp e45f956e1486e952a4efeb123e15568518fb53fe Diff: https://reviews.apache.org/r/25487/diff/ Testing --- make check. Thanks, Jiang Yan Xu
Re: Review Request 25511: Pulled the log line in ZooKeeperTestServer::shutdownNetwork() to above the shutdown call.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/25511/#review52999 --- Ship it! Ship It! - Ben Mahler On Sept. 10, 2014, 6:02 p.m., Jiang Yan Xu wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/25511/ --- (Updated Sept. 10, 2014, 6:02 p.m.) Review request for mesos and Ben Mahler. Repository: mesos-git Description --- - When debugging zookeeper related tests it's often more useful to know when the tests is about to shut down the ZK server to reason about the order of events. Otherwise client disconnections are often logged before this shutdown line and can be confusing. Diffs - src/tests/zookeeper_test_server.cpp a8c9b1cd8a546abdeb4d89a8fe9ebc3b3d577665 Diff: https://reviews.apache.org/r/25511/diff/ Testing --- make check. Thanks, Jiang Yan Xu
Re: Review Request 25035: Fix for MESOS-1688
On Sept. 9, 2014, 7:10 p.m., Vinod Kone wrote: src/master/master.cpp, line 1901 https://reviews.apache.org/r/25035/diff/4/?file=682182#file682182line1901 I like these warnings. Are you planning to get this in to 0.20.1 or 0.21.0 ? If the former, can you add this to the list of deprecations in CHANGELOG. Martin Weindel wrote: Would be nice to see this in 0.20.1. But it is not clear to me, how to update the CHANGELOG. There is no section for upcoming releases. Just start one for 0.20.1 and just add the deprecation. See how we did it for 0.20.0 and 0.19.1 for inspiration. As we get close to releasing 0.20.1, the release manager will make sure to update the CHANGELOG with the tickets and other info. - Vinod --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/25035/#review52763 --- On Sept. 10, 2014, 10 p.m., Martin Weindel wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/25035/ --- (Updated Sept. 10, 2014, 10 p.m.) Review request for mesos and Vinod Kone. Bugs: MESOS-1688 https://issues.apache.org/jira/browse/MESOS-1688 Repository: mesos-git Description --- As already explained in JIRA MESOS-1688, there are schedulers allocating memory only for the executor and not for tasks. For tasks only CPU resources are allocated in this case. Such a scheduler does not get offered any idle CPUs if the slave has nearly used up all memory. This can easily lead to a dead lock (in the application, not in Mesos). Simple example: 1. Scheduler allocates all memory of a slave for an executor 2. Scheduler launches a task for this executor (allocating 1 CPU) 3. Task finishes: 1 CPU , 0 MB memory allocatable. 4. No offers are made, as no memory is left. Scheduler will wait for offers forever. Dead lock in the application. To fix this problem, offers must be made if CPU resources are allocatable without considering allocatable memory Diffs - src/common/resources.cpp edf36b1 src/master/constants.cpp faa1503 src/master/hierarchical_allocator_process.hpp 34f8cd6 src/master/master.cpp 18464ba src/tests/allocator_tests.cpp 774528a Diff: https://reviews.apache.org/r/25035/diff/ Testing --- Deployed patched Mesos 0.19.1 on a small cluster with 3 slaves and tested running multiple parallel Spark jobs in fine-grained mode to saturate allocatable memory. The jobs run fine now. This load always caused a dead lock in all Spark jobs within one minute with the unpatched Mesos. Thanks, Martin Weindel
Re: Review Request 25511: Pulled the log line in ZooKeeperTestServer::shutdownNetwork() to above the shutdown call.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/25511/#review53003 --- Patch looks great! Reviews applied: [25511] All tests passed. - Mesos ReviewBot On Sept. 10, 2014, 6:02 p.m., Jiang Yan Xu wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/25511/ --- (Updated Sept. 10, 2014, 6:02 p.m.) Review request for mesos and Ben Mahler. Repository: mesos-git Description --- - When debugging zookeeper related tests it's often more useful to know when the tests is about to shut down the ZK server to reason about the order of events. Otherwise client disconnections are often logged before this shutdown line and can be confusing. Diffs - src/tests/zookeeper_test_server.cpp a8c9b1cd8a546abdeb4d89a8fe9ebc3b3d577665 Diff: https://reviews.apache.org/r/25511/diff/ Testing --- make check. Thanks, Jiang Yan Xu
Re: Review Request 25035: Fix for MESOS-1688
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/25035/#review53002 --- src/common/resources.cpp https://reviews.apache.org/r/25035/#comment92333 I'm not sure what's happening here. Can you add a comment? src/master/master.cpp https://reviews.apache.org/r/25035/#comment92334 Add a TODO: TODO(martin): Return Error instead of logging a warning in 0.21.0. src/tests/allocator_tests.cpp https://reviews.apache.org/r/25035/#comment92336 s/with cpus only/using only cpus/ src/tests/allocator_tests.cpp https://reviews.apache.org/r/25035/#comment92335 s/tasks/task/ src/tests/allocator_tests.cpp https://reviews.apache.org/r/25035/#comment92337 s/with memory only/using only memory/ src/tests/allocator_tests.cpp https://reviews.apache.org/r/25035/#comment92338 s/mem/memory/ src/tests/allocator_tests.cpp https://reviews.apache.org/r/25035/#comment92339 s/tasks/task/ - Vinod Kone On Sept. 10, 2014, 10 p.m., Martin Weindel wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/25035/ --- (Updated Sept. 10, 2014, 10 p.m.) Review request for mesos and Vinod Kone. Bugs: MESOS-1688 https://issues.apache.org/jira/browse/MESOS-1688 Repository: mesos-git Description --- As already explained in JIRA MESOS-1688, there are schedulers allocating memory only for the executor and not for tasks. For tasks only CPU resources are allocated in this case. Such a scheduler does not get offered any idle CPUs if the slave has nearly used up all memory. This can easily lead to a dead lock (in the application, not in Mesos). Simple example: 1. Scheduler allocates all memory of a slave for an executor 2. Scheduler launches a task for this executor (allocating 1 CPU) 3. Task finishes: 1 CPU , 0 MB memory allocatable. 4. No offers are made, as no memory is left. Scheduler will wait for offers forever. Dead lock in the application. To fix this problem, offers must be made if CPU resources are allocatable without considering allocatable memory Diffs - src/common/resources.cpp edf36b1 src/master/constants.cpp faa1503 src/master/hierarchical_allocator_process.hpp 34f8cd6 src/master/master.cpp 18464ba src/tests/allocator_tests.cpp 774528a Diff: https://reviews.apache.org/r/25035/diff/ Testing --- Deployed patched Mesos 0.19.1 on a small cluster with 3 slaves and tested running multiple parallel Spark jobs in fine-grained mode to saturate allocatable memory. The jobs run fine now. This load always caused a dead lock in all Spark jobs within one minute with the unpatched Mesos. Thanks, Martin Weindel
Re: Review Request 25035: Fix for MESOS-1688
On Sept. 11, 2014, 5:35 a.m., Vinod Kone wrote: Can you also update the summary of the review to something more meaningful? We typically use the summary to generate the commit message. - Vinod --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/25035/#review53002 --- On Sept. 10, 2014, 10 p.m., Martin Weindel wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/25035/ --- (Updated Sept. 10, 2014, 10 p.m.) Review request for mesos and Vinod Kone. Bugs: MESOS-1688 https://issues.apache.org/jira/browse/MESOS-1688 Repository: mesos-git Description --- As already explained in JIRA MESOS-1688, there are schedulers allocating memory only for the executor and not for tasks. For tasks only CPU resources are allocated in this case. Such a scheduler does not get offered any idle CPUs if the slave has nearly used up all memory. This can easily lead to a dead lock (in the application, not in Mesos). Simple example: 1. Scheduler allocates all memory of a slave for an executor 2. Scheduler launches a task for this executor (allocating 1 CPU) 3. Task finishes: 1 CPU , 0 MB memory allocatable. 4. No offers are made, as no memory is left. Scheduler will wait for offers forever. Dead lock in the application. To fix this problem, offers must be made if CPU resources are allocatable without considering allocatable memory Diffs - src/common/resources.cpp edf36b1 src/master/constants.cpp faa1503 src/master/hierarchical_allocator_process.hpp 34f8cd6 src/master/master.cpp 18464ba src/tests/allocator_tests.cpp 774528a Diff: https://reviews.apache.org/r/25035/diff/ Testing --- Deployed patched Mesos 0.19.1 on a small cluster with 3 slaves and tested running multiple parallel Spark jobs in fine-grained mode to saturate allocatable memory. The jobs run fine now. This load always caused a dead lock in all Spark jobs within one minute with the unpatched Mesos. Thanks, Martin Weindel