[jira] [Commented] (MESOS-3586) Installing Mesos 0.24.0 on multiple systems. Failed test on MemoryPressureMesosTest.CGROUPS_ROOT_Statistics
[ https://issues.apache.org/jira/browse/MESOS-3586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15034263#comment-15034263 ] Jan Schlicht commented on MESOS-3586: - I have to reopen this, as I've found the same behavior using the 0.26-rc2 on CentOS 7.1. Noticed some flakiness while running {{sudo ./bin/mesos-tests.sh}} and could reproduce it by running {{sudo ./bin/mesos-tests.sh - --gtest_filter="MemoryPressureMesosTest.CGROUPS_ROOT_Statistics" --gtest_repeat=-1 --gtest_break_on_failure}} until it breaks. Here's a verbose output of a failing test: {noformat} [ RUN ] MemoryPressureMesosTest.CGROUPS_ROOT_Statistics I1201 18:07:51.136508 18883 cgroups.cpp:2429] Freezing cgroup /sys/fs/cgroup/freezer/mesos_test_7bcd6aa5-6f35-44ea-90a5-e7f047edbffb/d540e60d-2d62-4a1e-b5ff-482f7b3cc1a5 I1201 18:07:51.144594 18886 cgroups.cpp:1411] Successfully froze cgroup /sys/fs/cgroup/freezer/mesos_test_7bcd6aa5-6f35-44ea-90a5-e7f047edbffb/d540e60d-2d62-4a1e-b5ff-482f7b3cc1a5 after 7.076864ms I1201 18:07:51.151480 18882 cgroups.cpp:2447] Thawing cgroup /sys/fs/cgroup/freezer/mesos_test_7bcd6aa5-6f35-44ea-90a5-e7f047edbffb/d540e60d-2d62-4a1e-b5ff-482f7b3cc1a5 I1201 18:07:51.162557 18886 cgroups.cpp:1440] Successfullly thawed cgroup /sys/fs/cgroup/freezer/mesos_test_7bcd6aa5-6f35-44ea-90a5-e7f047edbffb/d540e60d-2d62-4a1e-b5ff-482f7b3cc1a5 after 11.026944ms I1201 18:07:51.172379 18887 cgroups.cpp:2429] Freezing cgroup /sys/fs/cgroup/freezer/mesos_test_7bcd6aa5-6f35-44ea-90a5-e7f047edbffb I1201 18:07:51.183791 18881 cgroups.cpp:1411] Successfully froze cgroup /sys/fs/cgroup/freezer/mesos_test_7bcd6aa5-6f35-44ea-90a5-e7f047edbffb after 7.8272ms I1201 18:07:51.192354 18887 cgroups.cpp:2447] Thawing cgroup /sys/fs/cgroup/freezer/mesos_test_7bcd6aa5-6f35-44ea-90a5-e7f047edbffb I1201 18:07:51.199439 18885 cgroups.cpp:1440] Successfullly thawed cgroup /sys/fs/cgroup/freezer/mesos_test_7bcd6aa5-6f35-44ea-90a5-e7f047edbffb after 7.028224ms I1201 18:07:51.332849 18866 leveldb.cpp:176] Opened db in 6.74674ms I1201 18:07:51.335450 18866 leveldb.cpp:183] Compacted db in 2.554513ms I1201 18:07:51.335539 18866 leveldb.cpp:198] Created db iterator in 53851ns I1201 18:07:51.335556 18866 leveldb.cpp:204] Seeked to beginning of db in 3455ns I1201 18:07:51.335561 18866 leveldb.cpp:273] Iterated through 0 keys in the db in 107ns I1201 18:07:51.335666 18866 replica.cpp:780] Replica recovered with log positions 0 -> 0 with 1 holes and 0 unlearned I1201 18:07:51.337374 18881 recover.cpp:449] Starting replica recovery I1201 18:07:51.338235 18881 recover.cpp:475] Replica is in EMPTY status I1201 18:07:51.340142 18880 replica.cpp:676] Replica in EMPTY status received a broadcasted recover request from (14)@127.0.0.1:57652 I1201 18:07:51.340749 18882 recover.cpp:195] Received a recover response from a replica in EMPTY status I1201 18:07:51.340975 18885 master.cpp:367] Master 2f17d97c-de40-491e-9706-bf83a9ffd08c (centos71) started on 127.0.0.1:57652 I1201 18:07:51.341475 18884 recover.cpp:566] Updating replica status to STARTING I1201 18:07:51.341152 18885 master.cpp:369] Flags at startup: --acls="" --allocation_interval="1secs" --allocator="HierarchicalDRF" --authenticate="true" --authenticate_slaves="true" --authenticators="crammd5" --authorizers="local" --credentials="/tmp/ap4rPt/credentials" --framework_sorter="drf" --help="false" --hostname_lookup="true" --initialize_driver_logging="true" --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" --max_slave_ping_timeouts="5" --quiet="false" --recovery_slave_removal_limit="100%" --registry="replicated_log" --registry_fetch_timeout="1mins" --registry_store_timeout="25secs" --registry_strict="true" --root_submissions="true" --slave_ping_timeout="15secs" --slave_reregister_timeout="10mins" --user_sorter="drf" --version="false" --webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/ap4rPt/master" --zk_session_timeout="10secs" W1201 18:07:51.341752 18885 master.cpp:372] ** Master bound to loopback interface! Cannot communicate with remote schedulers or slaves. You might want to set '--ip' flag to a routable IP address. ** I1201 18:07:51.341794 18885 master.cpp:414] Master only allowing authenticated frameworks to register I1201 18:07:51.341804 18885 master.cpp:419] Master only allowing authenticated slaves to register I1201 18:07:51.341879 18885 credentials.hpp:37] Loading credentials for authentication from '/tmp/ap4rPt/credentials' I1201 18:07:51.345211 18885 master.cpp:458] Using default 'crammd5' authenticator I1201 18:07:51.345268 18882 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 3.5302ms I1201 18:07:51.345289 18882 replica.cpp:323] Persisted replica status to STARTING I1201 18:07:51.345350 18885 authenticator.cpp:520] Initializing server SASL I1201
[jira] [Commented] (MESOS-3586) Installing Mesos 0.24.0 on multiple systems. Failed test on MemoryPressureMesosTest.CGROUPS_ROOT_Statistics
[ https://issues.apache.org/jira/browse/MESOS-3586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15034269#comment-15034269 ] Jan Schlicht commented on MESOS-3586: - I used the following vagrant generator to setup a CentOS virt env: {noformat} cat << EOF > Vagrantfile # -*- mode: ruby -*-" > # vi: set ft=ruby : Vagrant.configure(2) do |config| # Disable shared folder to prevent certain kernel module dependencies. config.vm.synced_folder ".", "/vagrant", disabled: true config.vm.hostname = "centos71" config.vm.box = "bento/centos-7.1" config.vm.provider "virtualbox" do |vb| vb.memory = 8192 vb.cpus = 8 end config.vm.provision "shell", inline: <<-SHELL yum -y update systemd yum install -y tar wget wget http://repos.fedorapeople.org/repos/dchen/apache-maven/epel-apache-maven.repo -O /etc/yum.repos.d/epel-apache-maven.repo yum groupinstall -y "Development Tools" yum install -y apache-maven python-devel java-1.7.0-openjdk-devel zlib-devel libcurl-devel openssl-devel cyrus-sasl-devel cyrus-sasl-md5 apr-devel subversion-devel apr-util-devel yum install -y libevent-devel yum install -y perf nmap-ncat yum install -y git yum install -y docker systemctl start docker systemctl enable docker docker info #wget -qO- https://get.docker.com/ | sh SHELL end EOF vagrant up vagrant reload vagrant ssh -c " git clone https://github.com/apache/mesos.git mesos cd mesos git checkout -b 0.26.0-rc2 0.26.0-rc2 ./bootstrap mkdir build cd build ../configure GTEST_FILTER="" make check sudo ./bin/mesos-tests.sh " {noformat} > Installing Mesos 0.24.0 on multiple systems. Failed test on > MemoryPressureMesosTest.CGROUPS_ROOT_Statistics > --- > > Key: MESOS-3586 > URL: https://issues.apache.org/jira/browse/MESOS-3586 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.24.0 > Environment: Ubuntu 14.04, 3.13.0-32 generic >Reporter: Miguel Bernadin > > I am install Mesos 0.24.0 on 4 servers which have very similar hardware and > software configurations. > After performing ../configure, make, and make check some servers have > completed successfully and other failed on test [ RUN ] > MemoryPressureMesosTest.CGROUPS_ROOT_Statistics. > Is there something I should check in this test? > PERFORMED MAKE CHECK NODE-001 > [ RUN ] MemoryPressureMesosTest.CGROUPS_ROOT_Statistics > I1005 14:37:35.585067 38479 exec.cpp:133] Version: 0.24.0 > I1005 14:37:35.593789 38497 exec.cpp:207] Executor registered on slave > 20151005-143735-2393768202-35106-27900-S0 > Registered executor on svdidac038.techlabs.accenture.com > Starting task 010b2fe9-4eac-4136-8a8a-6ce7665488b0 > Forked command at 38510 > sh -c 'while true; do dd count=512 bs=1M if=/dev/zero of=./temp; done' > PERFORMED MAKE CHECK NODE-002 > [ RUN ] MemoryPressureMesosTest.CGROUPS_ROOT_Statistics > I1005 14:38:58.794112 36997 exec.cpp:133] Version: 0.24.0 > I1005 14:38:58.802851 37022 exec.cpp:207] Executor registered on slave > 20151005-143857-2360213770-50427-26325-S0 > Registered executor on svdidac039.techlabs.accenture.com > Starting task 9bb317ba-41cb-44a4-b507-d1c85ceabc28 > sh -c 'while true; do dd count=512 bs=1M if=/dev/zero of=./temp; done' > Forked command at 37028 > ../../src/tests/containerizer/memory_pressure_tests.cpp:145: Failure > Expected: (usage.get().mem_medium_pressure_counter()) >= > (usage.get().mem_critical_pressure_counter()), actual: 5 vs 6 > 2015-10-05 > 14:39:00,130:26325(0x2af08cc78700):ZOO_ERROR@handle_socket_error_msg@1697: > Socket [127.0.0.1:37198] zk retcode=-4, errno=111(Connection refused): server > refused to accept the client > [ FAILED ] MemoryPressureMesosTest.CGROUPS_ROOT_Statistics (4303 ms) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3586) Installing Mesos 0.24.0 on multiple systems. Failed test on MemoryPressureMesosTest.CGROUPS_ROOT_Statistics
[ https://issues.apache.org/jira/browse/MESOS-3586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15034339#comment-15034339 ] Jan Schlicht commented on MESOS-3586: - It seems like a timing problem in the test. It's making the assumption that {{os::sleep}} will sleep for the exact amount that it's provided with. > Installing Mesos 0.24.0 on multiple systems. Failed test on > MemoryPressureMesosTest.CGROUPS_ROOT_Statistics > --- > > Key: MESOS-3586 > URL: https://issues.apache.org/jira/browse/MESOS-3586 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.24.0 > Environment: Ubuntu 14.04, 3.13.0-32 generic >Reporter: Miguel Bernadin > > I am install Mesos 0.24.0 on 4 servers which have very similar hardware and > software configurations. > After performing ../configure, make, and make check some servers have > completed successfully and other failed on test [ RUN ] > MemoryPressureMesosTest.CGROUPS_ROOT_Statistics. > Is there something I should check in this test? > PERFORMED MAKE CHECK NODE-001 > [ RUN ] MemoryPressureMesosTest.CGROUPS_ROOT_Statistics > I1005 14:37:35.585067 38479 exec.cpp:133] Version: 0.24.0 > I1005 14:37:35.593789 38497 exec.cpp:207] Executor registered on slave > 20151005-143735-2393768202-35106-27900-S0 > Registered executor on svdidac038.techlabs.accenture.com > Starting task 010b2fe9-4eac-4136-8a8a-6ce7665488b0 > Forked command at 38510 > sh -c 'while true; do dd count=512 bs=1M if=/dev/zero of=./temp; done' > PERFORMED MAKE CHECK NODE-002 > [ RUN ] MemoryPressureMesosTest.CGROUPS_ROOT_Statistics > I1005 14:38:58.794112 36997 exec.cpp:133] Version: 0.24.0 > I1005 14:38:58.802851 37022 exec.cpp:207] Executor registered on slave > 20151005-143857-2360213770-50427-26325-S0 > Registered executor on svdidac039.techlabs.accenture.com > Starting task 9bb317ba-41cb-44a4-b507-d1c85ceabc28 > sh -c 'while true; do dd count=512 bs=1M if=/dev/zero of=./temp; done' > Forked command at 37028 > ../../src/tests/containerizer/memory_pressure_tests.cpp:145: Failure > Expected: (usage.get().mem_medium_pressure_counter()) >= > (usage.get().mem_critical_pressure_counter()), actual: 5 vs 6 > 2015-10-05 > 14:39:00,130:26325(0x2af08cc78700):ZOO_ERROR@handle_socket_error_msg@1697: > Socket [127.0.0.1:37198] zk retcode=-4, errno=111(Connection refused): server > refused to accept the client > [ FAILED ] MemoryPressureMesosTest.CGROUPS_ROOT_Statistics (4303 ms) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3586) Installing Mesos 0.24.0 on multiple systems. Failed test on MemoryPressureMesosTest.CGROUPS_ROOT_Statistics
[ https://issues.apache.org/jira/browse/MESOS-3586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14944276#comment-14944276 ] Marco Massenzio commented on MESOS-3586: We would not recommend to install Mesos on a cluster by building it on *every* node: you would be probably better off using the pre-packaged binaries or by building it on a build machine and then distributing your binary(ies) with whatever deployment manager you prefer. Having said that, I'm also guessing you were running {{make check}} as {{root}} on your system? (the {{*ROOT}} tests are only run when the user is the superuser on a system) Can you please provide more details on OS/distribution/environment for the failure? Finally - it would seem that the actual error is due to something to do with connecting to ZooKeeper (tests do that, they launch a local instance of ZK and then try to connect to it; if for whatever reason that fails, the tests will fail too). > Installing Mesos 0.24.0 on multiple systems. Failed test on > MemoryPressureMesosTest.CGROUPS_ROOT_Statistics > --- > > Key: MESOS-3586 > URL: https://issues.apache.org/jira/browse/MESOS-3586 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.24.0 > Environment: Ubuntu 14.04, 3.13.0-32 generic >Reporter: Miguel Bernadin > > I am install Mesos 0.24.0 on 4 servers which have very similar hardware and > software configurations. > After performing ../configure, make, and make check some servers have > completed successfully and other failed on test [ RUN ] > MemoryPressureMesosTest.CGROUPS_ROOT_Statistics. > Is there something I should check in this test? > PERFORMED MAKE CHECK NODE-001 > [ RUN ] MemoryPressureMesosTest.CGROUPS_ROOT_Statistics > I1005 14:37:35.585067 38479 exec.cpp:133] Version: 0.24.0 > I1005 14:37:35.593789 38497 exec.cpp:207] Executor registered on slave > 20151005-143735-2393768202-35106-27900-S0 > Registered executor on svdidac038.techlabs.accenture.com > Starting task 010b2fe9-4eac-4136-8a8a-6ce7665488b0 > Forked command at 38510 > sh -c 'while true; do dd count=512 bs=1M if=/dev/zero of=./temp; done' > PERFORMED MAKE CHECK NODE-002 > [ RUN ] MemoryPressureMesosTest.CGROUPS_ROOT_Statistics > I1005 14:38:58.794112 36997 exec.cpp:133] Version: 0.24.0 > I1005 14:38:58.802851 37022 exec.cpp:207] Executor registered on slave > 20151005-143857-2360213770-50427-26325-S0 > Registered executor on svdidac039.techlabs.accenture.com > Starting task 9bb317ba-41cb-44a4-b507-d1c85ceabc28 > sh -c 'while true; do dd count=512 bs=1M if=/dev/zero of=./temp; done' > Forked command at 37028 > ../../src/tests/containerizer/memory_pressure_tests.cpp:145: Failure > Expected: (usage.get().mem_medium_pressure_counter()) >= > (usage.get().mem_critical_pressure_counter()), actual: 5 vs 6 > 2015-10-05 > 14:39:00,130:26325(0x2af08cc78700):ZOO_ERROR@handle_socket_error_msg@1697: > Socket [127.0.0.1:37198] zk retcode=-4, errno=111(Connection refused): server > refused to accept the client > [ FAILED ] MemoryPressureMesosTest.CGROUPS_ROOT_Statistics (4303 ms) -- This message was sent by Atlassian JIRA (v6.3.4#6332)