[jira] [Created] (MESOS-8353) Duplicate task for same framework on multiple agents crashes out master after failover
Zhitao Li created MESOS-8353: Summary: Duplicate task for same framework on multiple agents crashes out master after failover Key: MESOS-8353 URL: https://issues.apache.org/jira/browse/MESOS-8353 Project: Mesos Issue Type: Bug Reporter: Zhitao Li We have seen a mesos master crash loop after a leader failover. After more investigation, it seems that a same task ID was managed to be created onto multiple Mesos agents in the cluster. One possible logical sequence which can lead to such problem: 1. Task T1 was launched to master M1 on agent A1 for framework F; 2. Master M1 failed over to M2; 3. Before A1 reregistered to M2, the same T1 was launched on to agent A2: M2 does not know previous T1 yet so it accepted it and sent to A2; 4. A1 reregistered: this probably crashed M2 (because same task cannot be added twice); 5. When M3 tries to come up after M2, it further crashes because both A1 and A2 tried to add a T1 to the framework. (I only have logs to prove the last step right now) This happened on 1.4.0 masters. Although this is probably triggered by incorrect retry logic on framework side, I wonder whether Mesos master should do extra protection to prevent such issue to happen. One possible idea to instruct one of the agents carrying tasks w/ duplicate ID to terminate corresponding tasks, or just refuse to reregister such agents and instruct them to shutdown. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (MESOS-8352) Resources may get over allocated to some roles while fail to meet the quota of other roles.
Meng Zhu created MESOS-8352: --- Summary: Resources may get over allocated to some roles while fail to meet the quota of other roles. Key: MESOS-8352 URL: https://issues.apache.org/jira/browse/MESOS-8352 Project: Mesos Issue Type: Bug Components: allocation Reporter: Meng Zhu Assignee: Meng Zhu In the quota role allocation stage, if a role gets some resources on an agent to meet its quota, it will also get all other resources on the same agent that it does not have quota for. This may starve roles behind it that have quotas set for those resources. To fix that, we need to track quota headroom in the quota role allocation stage. In that stage, if a role has no quota set for a scalar resource, it will get that resource only when two conditions are both met: - It got some other resources on the same agent to meet its quota; And - After allocating those resources, quota headroom is still above the required amount. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-8224) mesos.interface 1.4.0 cannot be installed with pip
[ https://issues.apache.org/jira/browse/MESOS-8224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16299469#comment-16299469 ] Benjamin Mahler commented on MESOS-8224: [~karya] can you take a look at the 1.4.0 egg issue? > mesos.interface 1.4.0 cannot be installed with pip > -- > > Key: MESOS-8224 > URL: https://issues.apache.org/jira/browse/MESOS-8224 > Project: Mesos > Issue Type: Task > Components: release >Reporter: Bill Farner > > This breaks some framework development tooling. > WIth latest pip: > {noformat} > $ python -m pip -V > pip 9.0.1 from > /Users/wfarner/code/aurora/build-support/python/pycharm.venv/lib/python2.7/site-packages > (python 2.7) > {noformat} > This works fine for previous releases: > {noformat} > $ python -m pip install mesos.interface==1.3.0 > Collecting mesos.interface==1.3.0 > ... > Installing collected packages: mesos.interface > Successfully installed mesos.interface-1.3.0 > {noformat} > But it does not for 1.4.0: > {noformat} > $ python -m pip install mesos.interface==1.4.0 > Collecting mesos.interface==1.4.0 > Could not find a version that satisfies the requirement > mesos.interface==1.4.0 (from versions: 0.21.2.linux-x86_64, > 0.22.1.2.linux-x86_64, 0.22.2.linux-x86_64, 0.23.1.linux-x86_64, > 0.24.1.linux-x86_64, 0.24.2.linux-x86_64, 0.25.0.linux-x86_64, > 0.25.1.linux-x86_64, 0.26.1.linux-x86_64, 0.27.0.linux-x86_64, > 0.27.1.linux-x86_64, 0.27.2.linux-x86_64, 0.28.0.linux-x86_64, > 0.28.1.linux-x86_64, 0.28.2.linux-x86_64, 1.0.0.linux-x86_64, > 1.0.1.linux-x86_64, 1.1.0.linux-x86_64, 1.2.0.linux-x86_64, > 1.3.0.linux-x86_64, 0.20.0, 0.20.1, 0.21.0, 0.21.1, 0.21.2, 0.22.0, 0.22.1.2, > 0.22.2, 0.23.0, 0.23.1, 0.24.0, 0.24.1, 0.24.2, 0.25.0, 0.25.1, 0.26.0, > 0.26.1, 0.27.0, 0.27.1, 0.27.2, 0.28.0, 0.28.1, 0.28.2, 1.0.0, 1.0.1, 1.1.0, > 1.2.0, 1.3.0) > No matching distribution found for mesos.interface==1.4.0 > {noformat} > Verbose output shows that pip skips the 1.4.0 distribution: > {noformat} > $ python -m pip install -v mesos.interface==1.4.0 | grep 1.4.0 > Collecting mesos.interface==1.4.0 > Skipping link > https://pypi.python.org/packages/ef/1b/d5b0c1456f755ad42477eaa9667e22d1f5fd8e2fce0f9b26937f93743f6c/mesos.interface-1.4.0-py2.7.egg#md5=32113860961d49c31f69f7b13a9bc063 > (from https://pypi.python.org/simple/mesos-interface/); unsupported archive > format: .egg > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (MESOS-8351) Improve mesos-style.py linters instantiation
Armand Grillet created MESOS-8351: - Summary: Improve mesos-style.py linters instantiation Key: MESOS-8351 URL: https://issues.apache.org/jira/browse/MESOS-8351 Project: Mesos Issue Type: Improvement Reporter: Armand Grillet Assignee: Armand Grillet Currently, all linters are instantiated when running mesos-style.py without checking first if they will be needed or not. In the main function of mesos-style.py, a check should be known to know which linters to instantiate and use depending on the input given (the list of files that should be linted). -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-8350) Resource provider-capable agents not correctly synchronizing checkpointed agent resources on reregistration
[ https://issues.apache.org/jira/browse/MESOS-8350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Bannier updated MESOS-8350: Description: For resource provider-capable agents the master does not re-send checkpointed resources on agent reregistration; instead the checkpointed resources sent as part of the {{ReregisterSlaveMessage}} should be used. This is not what happens in reality. If e.g., checkpointing of an offer operation fails and the agent fails over the checkpointed resources would, as expected, not be reflected in the agent, but would still be assumed in the master. A workaround is to fail over the master which would lead to the newly elected master bootstrapping agent state from {{ReregisterSlaveMessage}}. was: For resource provider-capable agents the master does not re-send checkpointed resources on agent reregistration; instead the checkpointed resources sent as part of the {{ReregisterSlaveMessage}} should be used. This is not what happens in reality. If e.g., checkpointing of an offer operation fails and the agent fails over the checkpointed resources would, as expected, not be reflected in the agent, but would still be assumed in the master. > Resource provider-capable agents not correctly synchronizing checkpointed > agent resources on reregistration > --- > > Key: MESOS-8350 > URL: https://issues.apache.org/jira/browse/MESOS-8350 > Project: Mesos > Issue Type: Bug > Components: master >Reporter: Benjamin Bannier >Assignee: Benjamin Bannier >Priority: Critical > > For resource provider-capable agents the master does not re-send checkpointed > resources on agent reregistration; instead the checkpointed resources sent as > part of the {{ReregisterSlaveMessage}} should be used. > This is not what happens in reality. If e.g., checkpointing of an offer > operation fails and the agent fails over the checkpointed resources would, as > expected, not be reflected in the agent, but would still be assumed in the > master. > A workaround is to fail over the master which would lead to the newly elected > master bootstrapping agent state from {{ReregisterSlaveMessage}}. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Comment Edited] (MESOS-6616) Error: dereferencing type-punned pointer will break strict-aliasing rules.
[ https://issues.apache.org/jira/browse/MESOS-6616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16295472#comment-16295472 ] Alexander Rukletsov edited comment on MESOS-6616 at 12/20/17 7:32 PM: -- I'm able to reproduce it on Debian 8.10 + gcc 5.5.0-6 with {{O2}} enabled. {noformat} admin@ip-10-178-168-202:~/mesos/build$ gcc --version gcc (Debian 5.5.0-6) 5.5.0 20171010 {noformat} {noformat} admin@ip-10-178-168-202:~/mesos/build$ lsb_release -a No LSB modules are available. Distributor ID: Debian Description:Debian GNU/Linux 8.10 (jessie) Release:8.10 Codename: jessie {noformat} {noformat} ../configure --enable-libevent --enable-ssl --disable-java --enable-optimize --disable-silent-rules --enable-install-module-dependencies {noformat} The compile error is likely triggered via {{--enable-optimize}} which turns {{O2}} optimizations on. It looks like certain versions of {{gcc}} mistakenly (?) assume that any dereferencing of a type-punned pointer violates strict-aliasing rules. More about strict aliasing and type-punned pointers: https://blog.regehr.org/archives/959 and https://gcc.gnu.org/onlinedocs/gcc-4.6.0/gnat_ugn_unw/Optimization-and-Strict-Aliasing.html More about {{cmsghdr}} structure: http://alas.matf.bg.ac.rs/manuals/lspe/snode=153.html was (Author: alexr): I'm able to reproduce it on Debian 8.10 + gcc 5.5.0-6 with {{O2}} enabled. {noformat} admin@ip-10-178-168-202:~/mesos/build$ gcc --version gcc (Debian 5.5.0-6) 5.5.0 20171010 {noformat} {noformat} admin@ip-10-178-168-202:~/mesos/build$ lsb_release -a No LSB modules are available. Distributor ID: Debian Description:Debian GNU/Linux 8.10 (jessie) Release:8.10 Codename: jessie {noformat} {noformat} ../configure --enable-libevent --enable-ssl --disable-java --enable-optimize --disable-silent-rules --enable-install-module-dependencies {noformat} The compile error is likely triggered via {{--enable-optimize}} which turns {{O2}} optimizations on. It looks like certain versions of {{gcc}} mistakenly (?) assume that any dereferencing of a type-punned pointer violates strict-aliasing rules. More about strict aliasing and type-punned pointers: https://blog.regehr.org/archives/959 and https://gcc.gnu.org/onlinedocs/gcc-4.6.0/gnat_ugn_unw/Optimization-and-Strict-Aliasing.html > Error: dereferencing type-punned pointer will break strict-aliasing rules. > -- > > Key: MESOS-6616 > URL: https://issues.apache.org/jira/browse/MESOS-6616 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.1.0, 1.2.3, 1.3.1, 1.4.1 > Environment: Fedora Rawhide; > Debian 8.10 + gcc 5.5.0-6 with {{O2}} >Reporter: Orion Poplawski >Assignee: Alexander Rukletsov > Labels: compile-error, mesosphere > > Trying to update the mesos package to 1.1.0 in Fedora. Getting: > {noformat} > libtool: compile: g++ -DPACKAGE_NAME=\"mesos\" -DPACKAGE_TARNAME=\"mesos\" > -DPACKAGE_VERSION=\"1.1.0\" "-DPACKAGE_STRING=\"mesos 1.1.0\"" > -DPACKAGE_BUGREPORT=\"\" -DPACKAGE_URL=\"\" -DPACKAGE=\"mesos\" > -DVERSION=\"1.1.0\" -DSTDC_HEADERS=1 -DHAVE_SYS_TYPES_H=1 -DHAVE_SYS_STAT_H=1 > -DHAVE_STDLIB_H=1 -DHAVE_STRING_H=1 -DHAVE_MEMORY_H=1 -DHAVE_STRINGS_H=1 > -DHAVE_INTTYPES_H=1 -DHAVE_STDINT_H=1 -DHAVE_UNISTD_H=1 -DHAVE_DLFCN_H=1 > -DLT_OBJDIR=\".libs/\" -DHAVE_CXX11=1 -DHAVE_PTHREAD_PRIO_INHERIT=1 > -DHAVE_PTHREAD=1 -DHAVE_LIBZ=1 -DHAVE_FTS_H=1 -DHAVE_APR_POOLS_H=1 > -DHAVE_LIBAPR_1=1 -DHAVE_BOOST_VERSION_HPP=1 -DHAVE_LIBCURL=1 > -DHAVE_ELFIO_ELFIO_HPP=1 -DHAVE_GLOG_LOGGING_H=1 -DHAVE_HTTP_PARSER_H=1 > -DMESOS_HAS_JAVA=1 -DHAVE_LEVELDB_DB_H=1 -DHAVE_LIBNL_3=1 > -DHAVE_LIBNL_ROUTE_3=1 -DHAVE_LIBNL_IDIAG_3=1 -DWITH_NETWORK_ISOLATOR=1 > -DHAVE_GOOGLE_PROTOBUF_MESSAGE_H=1 -DHAVE_EV_H=1 -DHAVE_PICOJSON_H=1 > -DHAVE_LIBSASL2=1 -DHAVE_SVN_VERSION_H=1 -DHAVE_LIBSVN_SUBR_1=1 > -DHAVE_SVN_DELTA_H=1 -DHAVE_LIBSVN_DELTA_1=1 -DHAVE_LIBZ=1 > -DHAVE_ZOOKEEPER_H=1 -DHAVE_PYTHON=\"2.7\" -DMESOS_HAS_PYTHON=1 -I. -Wall > -Werror -Wsign-compare -DLIBDIR=\"/usr/lib64\" > -DPKGLIBEXECDIR=\"/usr/libexec/mesos\" -DPKGDATADIR=\"/usr/share/mesos\" > -DPKGMODULEDIR=\"/usr/lib64/mesos/modules\" -I../include -I../include > -I../include/mesos -DPICOJSON_USE_INT64 -D__STDC_FORMAT_MACROS > -I../3rdparty/libprocess/include -I../3rdparty/nvml-352.79 > -I../3rdparty/stout/include -DHAS_AUTHENTICATION=1 -Iyes/include > -I/usr/include/subversion-1 -Iyes/include -Iyes/include -Iyes/include/libnl3 > -Iyes/include -I/ -Iyes/include -I/usr/include/apr-1 -I/usr/include/apr-1.0 > -I/builddir/build/BUILD/mesos-1.1.0/libev-4.15/include -isystem yes/include > -Iyes/include -I/usr/src/gmock -I/usr/src/gmock/include -I/usr/src/gmock/src > -I/usr/src/gmock/gtest -I/usr/src/gmock/gtest/include >
[jira] [Updated] (MESOS-8335) ProvisionerDockerTest fails on Debian 9
[ https://issues.apache.org/jira/browse/MESOS-8335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Armand Grillet updated MESOS-8335: -- Description: Version of Docker used: Docker version 17.11.0-ce, build 1caf76c Version of Curl used: curl 7.52.1 (x86_64-pc-linux-gnu) libcurl/7.52.1 OpenSSL/1.0.2l zlib/1.2.8 libidn2/0.16 libpsl/0.17.0 (+libidn2/0.16) libssh2/1.7.0 nghttp2/1.18.1 librtmp/2.3 Error: {code} [ RUN ] ImageAlpine/ProvisionerDockerTest.ROOT_INTERNET_CURL_SimpleCommand/2 I1215 00:09:28.694677 19343 cluster.cpp:172] Creating default 'local' authorizer I1215 00:09:28.697144 30867 master.cpp:456] Master 75b48a47-7b6b-4e60-82d3-dfdc0cf8bff3 (ip-172-16-10-160.ec2.internal) started on 127.0.1.1:35029 I1215 00:09:28.697163 30867 master.cpp:458] Flags at startup: --acls="" --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" --allocation_interval="1secs" --allocator="HierarchicalDRF" --authenticate_agents="true" --authenticate_frameworks="true" --authenticate_http_frameworks="true" --authenticate_http_readonly="true" --authenticate_http_readwrite="true" --authenticators="crammd5" --authorizers="local" --credentials="/tmp/4RYdF1/credentials" --filter_gpu_resources="true" --framework_sorter="drf" --help="false" --hostname_lookup="true" --http_authenticators="basic" --http_framework_authenticators="basic" --initialize_driver_logging="true" --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" --max_unreachable_tasks_per_framework="1000" --port="5050" --quiet="false" --recovery_agent_removal_limit="100%" --registry="in_memory" --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" --registry_store_timeout="100secs" --registry_strict="false" --root_submissions="true" --user_sorter="drf" --version="false" --webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/4RYdF1/master" --zk_session_timeout="10secs" I1215 00:09:28.697413 30867 master.cpp:507] Master only allowing authenticated frameworks to register I1215 00:09:28.697422 30867 master.cpp:513] Master only allowing authenticated agents to register I1215 00:09:28.697427 30867 master.cpp:519] Master only allowing authenticated HTTP frameworks to register I1215 00:09:28.697433 30867 credentials.hpp:37] Loading credentials for authentication from '/tmp/4RYdF1/credentials' I1215 00:09:28.697654 30867 master.cpp:563] Using default 'crammd5' authenticator I1215 00:09:28.697806 30867 http.cpp:1045] Creating default 'basic' HTTP authenticator for realm 'mesos-master-readonly' I1215 00:09:28.697962 30867 http.cpp:1045] Creating default 'basic' HTTP authenticator for realm 'mesos-master-readwrite' I1215 00:09:28.698076 30867 http.cpp:1045] Creating default 'basic' HTTP authenticator for realm 'mesos-master-scheduler' I1215 00:09:28.698194 30867 master.cpp:642] Authorization enabled I1215 00:09:28.698468 30864 hierarchical.cpp:175] Initialized hierarchical allocator process I1215 00:09:28.698563 30864 whitelist_watcher.cpp:77] No whitelist given I1215 00:09:28.701695 30871 master.cpp:2209] Elected as the leading master! I1215 00:09:28.701723 30871 master.cpp:1689] Recovering from registrar I1215 00:09:28.701859 30869 registrar.cpp:347] Recovering registrar I1215 00:09:28.702401 30869 registrar.cpp:391] Successfully fetched the registry (0B) in 507904ns I1215 00:09:28.702495 30869 registrar.cpp:495] Applied 1 operations in 28977ns; attempting to update the registry I1215 00:09:28.702997 30869 registrar.cpp:552] Successfully updated the registry in 464896ns I1215 00:09:28.703086 30869 registrar.cpp:424] Successfully recovered registrar I1215 00:09:28.703640 30865 master.cpp:1802] Recovered 0 agents from the registry (167B); allowing 10mins for agents to re-register I1215 00:09:28.703661 30869 hierarchical.cpp:213] Skipping recovery of hierarchical allocator: nothing to recover W1215 00:09:28.706816 19343 process.cpp:2756] Attempted to spawn already running process files@127.0.1.1:35029 I1215 00:09:28.707818 19343 containerizer.cpp:304] Using isolation { environment_secret, volume/sandbox_path, volume/host_path, docker/runtime, filesystem/linux, network/cni, volume/image } I1215 00:09:28.712363 19343 linux_launcher.cpp:146] Using /sys/fs/cgroup/freezer as the freezer hierarchy for the Linux launcher sh: 1: hadoop: not found I1215 00:09:28.740589 19343 fetcher.cpp:69] Skipping URI fetcher plugin 'hadoop' as it could not be created: Failed to create HDFS client: Hadoop client is not available, exit status: 32512 I1215 00:09:28.740788 19343 registry_puller.cpp:129] Creating registry puller with docker registry 'https://registry-1.docker.io' I1215 00:09:28.742266 19343 provisioner.cpp:299] Using default backend 'overlay' I1215 00:09:28.745649
[jira] [Updated] (MESOS-8335) ProvisionerDockerTest fails on Debian 9
[ https://issues.apache.org/jira/browse/MESOS-8335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Armand Grillet updated MESOS-8335: -- Description: Version of Docker used: Docker version 17.11.0-ce, build 1caf76c Version of Curl used: curl 7.52.1 (x86_64-pc-linux-gnu) libcurl/7.52.1 OpenSSL/1.0.2l zlib/1.2.8 libidn2/0.16 libpsl/0.17.0 (+libidn2/0.16) libssh2/1.7.0 nghttp2/1.18.1 librtmp/2.3 Error: {code} [ RUN ] ImageAlpine/ProvisionerDockerTest.ROOT_INTERNET_CURL_SimpleCommand/2 I1215 00:09:28.694677 19343 cluster.cpp:172] Creating default 'local' authorizer I1215 00:09:28.697144 30867 master.cpp:456] Master 75b48a47-7b6b-4e60-82d3-dfdc0cf8bff3 (ip-172-16-10-160.ec2.internal) started on 127.0.1.1:35029 I1215 00:09:28.697163 30867 master.cpp:458] Flags at startup: --acls="" --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" --allocation_interval="1secs" --allocator="HierarchicalDRF" --authenticate_agents="true" --authenticate_frameworks="true" --authenticate_http_frameworks="true" --authenticate_http_readonly="true" --authenticate_http_readwrite="true" --authenticators="crammd5" --authorizers="local" --credentials="/tmp/4RYdF1/credentials" --filter_gpu_resources="true" --framework_sorter="drf" --help="false" --hostname_lookup="true" --http_authenticators="basic" --http_framework_authenticators="basic" --initialize_driver_logging="true" --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" --max_unreachable_tasks_per_framework="1000" --port="5050" --quiet="false" --recovery_agent_removal_limit="100%" --registry="in_memory" --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" --registry_store_timeout="100secs" --registry_strict="false" --root_submissions="true" --user_sorter="drf" --version="false" --webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/4RYdF1/master" --zk_session_timeout="10secs" I1215 00:09:28.697413 30867 master.cpp:507] Master only allowing authenticated frameworks to register I1215 00:09:28.697422 30867 master.cpp:513] Master only allowing authenticated agents to register I1215 00:09:28.697427 30867 master.cpp:519] Master only allowing authenticated HTTP frameworks to register I1215 00:09:28.697433 30867 credentials.hpp:37] Loading credentials for authentication from '/tmp/4RYdF1/credentials' I1215 00:09:28.697654 30867 master.cpp:563] Using default 'crammd5' authenticator I1215 00:09:28.697806 30867 http.cpp:1045] Creating default 'basic' HTTP authenticator for realm 'mesos-master-readonly' I1215 00:09:28.697962 30867 http.cpp:1045] Creating default 'basic' HTTP authenticator for realm 'mesos-master-readwrite' I1215 00:09:28.698076 30867 http.cpp:1045] Creating default 'basic' HTTP authenticator for realm 'mesos-master-scheduler' I1215 00:09:28.698194 30867 master.cpp:642] Authorization enabled I1215 00:09:28.698468 30864 hierarchical.cpp:175] Initialized hierarchical allocator process I1215 00:09:28.698563 30864 whitelist_watcher.cpp:77] No whitelist given I1215 00:09:28.701695 30871 master.cpp:2209] Elected as the leading master! I1215 00:09:28.701723 30871 master.cpp:1689] Recovering from registrar I1215 00:09:28.701859 30869 registrar.cpp:347] Recovering registrar I1215 00:09:28.702401 30869 registrar.cpp:391] Successfully fetched the registry (0B) in 507904ns I1215 00:09:28.702495 30869 registrar.cpp:495] Applied 1 operations in 28977ns; attempting to update the registry I1215 00:09:28.702997 30869 registrar.cpp:552] Successfully updated the registry in 464896ns I1215 00:09:28.703086 30869 registrar.cpp:424] Successfully recovered registrar I1215 00:09:28.703640 30865 master.cpp:1802] Recovered 0 agents from the registry (167B); allowing 10mins for agents to re-register I1215 00:09:28.703661 30869 hierarchical.cpp:213] Skipping recovery of hierarchical allocator: nothing to recover W1215 00:09:28.706816 19343 process.cpp:2756] Attempted to spawn already running process files@127.0.1.1:35029 I1215 00:09:28.707818 19343 containerizer.cpp:304] Using isolation { environment_secret, volume/sandbox_path, volume/host_path, docker/runtime, filesystem/linux, network/cni, volume/image } I1215 00:09:28.712363 19343 linux_launcher.cpp:146] Using /sys/fs/cgroup/freezer as the freezer hierarchy for the Linux launcher sh: 1: hadoop: not found I1215 00:09:28.740589 19343 fetcher.cpp:69] Skipping URI fetcher plugin 'hadoop' as it could not be created: Failed to create HDFS client: Hadoop client is not available, exit status: 32512 I1215 00:09:28.740788 19343 registry_puller.cpp:129] Creating registry puller with docker registry 'https://registry-1.docker.io' I1215 00:09:28.742266 19343 provisioner.cpp:299] Using default backend 'overlay' I1215 00:09:28.745649
[jira] [Updated] (MESOS-8346) Resubscription of a resource provider will crash the agent if its HTTP connection isn't closed
[ https://issues.apache.org/jira/browse/MESOS-8346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jan Schlicht updated MESOS-8346: Shepherd: Benjamin Bannier > Resubscription of a resource provider will crash the agent if its HTTP > connection isn't closed > -- > > Key: MESOS-8346 > URL: https://issues.apache.org/jira/browse/MESOS-8346 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.5.0 >Reporter: Jan Schlicht >Assignee: Jan Schlicht >Priority: Blocker > Labels: mesosphere > > A resource provider might resubscribe while its old HTTP connection wasn't > properly closed. In that case an agent will crashm with, e.g., the following > log: > {noformat} > I1219 13:33:51.937295 128610304 manager.cpp:570] Subscribing resource > provider > {"id":{"value":"8e71beef-796e-4bde-9257-952ed0f230a5"},"name":"test","type":"org.apache.mesos.rp.test"} > I1219 13:33:51.937443 128610304 manager.cpp:134] Terminating resource > provider 8e71beef-796e-4bde-9257-952ed0f230a5 > I1219 13:33:51.937760 128610304 manager.cpp:134] Terminating resource > provider 8e71beef-796e-4bde-9257-952ed0f230a5 > E1219 13:33:51.937851 129683456 http_connection.hpp:445] End-Of-File received > I1219 13:33:51.937865 131293184 slave.cpp:7105] Handling resource provider > message 'DISCONNECT: resource provider 8e71beef-796e-4bde-9257-952ed0f230a5' > I1219 13:33:51.937968 131293184 slave.cpp:7347] Forwarding new total > resources cpus:2; mem:1024; disk:1024; ports:[31000-32000] > F1219 13:33:51.938052 132366336 manager.cpp:606] Check failed: > resourceProviders.subscribed.contains(resourceProviderId) > *** Check failure stack trace: *** > E1219 13:33:51.938583 130756608 http_connection.hpp:445] End-Of-File received > I1219 13:33:51.938987 129683456 hierarchical.cpp:669] Agent > 0019c3fa-28c5-43a9-88d0-709eee271c62-S0 (172.18.8.13) updated with total > resources cpus:2; mem:1024; disk:1024; ports:[31000-32000] > @0x1125380ef google::LogMessageFatal::~LogMessageFatal() > @0x112534ae9 google::LogMessageFatal::~LogMessageFatal() > I1219 13:33:51.939131 129683456 hierarchical.cpp:1517] Performed allocation > for 1 agents in 61830ns > I1219 13:33:51.945793 2646795072 slave.cpp:927] Agent terminating > I1219 13:33:51.945955 129146880 master.cpp:1305] Agent > 0019c3fa-28c5-43a9-88d0-709eee271c62-S0 at slave(1)@172.18.8.13:64430 > (172.18.8.13) disconnected > I1219 13:33:51.945979 129146880 master.cpp:3364] Disconnecting agent > 0019c3fa-28c5-43a9-88d0-709eee271c62-S0 at slave(1)@172.18.8.13:64430 > (172.18.8.13) > I1219 13:33:51.946022 129146880 master.cpp:3383] Deactivating agent > 0019c3fa-28c5-43a9-88d0-709eee271c62-S0 at slave(1)@172.18.8.13:64430 > (172.18.8.13) > I1219 13:33:51.946081 131293184 hierarchical.cpp:766] Agent > 0019c3fa-28c5-43a9-88d0-709eee271c62-S0 deactivated > @0x115f2761d > mesos::internal::ResourceProviderManagerProcess::subscribe()::$_2::operator()() > @0x115f2977d > _ZN5cpp176invokeIZN5mesos8internal30ResourceProviderManagerProcess9subscribeERKNS2_14HttpConnectionERKNS1_17resource_provider14Call_SubscribeEE3$_2JN7process6FutureI7NothingEDTclclsr3stdE7forwardIT_Efp_Espclsr3stdE7forwardIT0_Efp0_EEEOSG_DpOSH_ > @0x115f29740 > _ZN6lambda8internal7PartialIZN5mesos8internal30ResourceProviderManagerProcess9subscribeERKNS3_14HttpConnectionERKNS2_17resource_provider14Call_SubscribeEE3$_2JN7process6FutureI7Nothing13invoke_expandISC_NSt3__15tupleIJSG_EEENSK_IJEEEJLm0DTclsr5cpp17E6invokeclsr3stdE7forwardIT_Efp_Espcl6expandclsr3stdE3getIXT2_EEclsr3stdE7forwardIT0_Efp0_EEclsr3stdE7forwardIT1_Efp2_OSN_OSO_N5cpp1416integer_sequenceImJXspT2_OSP_ > @0x115f296bb > _ZNO6lambda8internal7PartialIZN5mesos8internal30ResourceProviderManagerProcess9subscribeERKNS3_14HttpConnectionERKNS2_17resource_provider14Call_SubscribeEE3$_2JN7process6FutureI7NothingclIJEEEDTcl13invoke_expandclL_ZNSt3__14moveIRSC_EEONSJ_16remove_referenceIT_E4typeEOSN_EdtdefpT1fEclL_ZNSK_IRNSJ_5tupleIJSG_ESQ_SR_EdtdefpT10bound_argsEcvN5cpp1416integer_sequenceImJLm0_Eclsr3stdE16forward_as_tuplespclsr3stdE7forwardIT_Efp_DpOSY_ > @0x115f2965d > _ZN5cpp176invokeIN6lambda8internal7PartialIZN5mesos8internal30ResourceProviderManagerProcess9subscribeERKNS5_14HttpConnectionERKNS4_17resource_provider14Call_SubscribeEE3$_2JN7process6FutureI7NothingEJEEEDTclclsr3stdE7forwardIT_Efp_Espclsr3stdE7forwardIT0_Efp0_EEEOSK_DpOSL_ > @0x115f29631 > _ZN6lambda8internal6InvokeIvEclINS0_7PartialIZN5mesos8internal30ResourceProviderManagerProcess9subscribeERKNS6_14HttpConnectionERKNS5_17resource_provider14Call_SubscribeEE3$_2JN7process6FutureI7NothingEJEEEvOT_DpOT0_ > @
[jira] [Assigned] (MESOS-8350) Resource provider-capable agents not correctly synchronizing checkpointed agent resources on reregistration
[ https://issues.apache.org/jira/browse/MESOS-8350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Bannier reassigned MESOS-8350: --- Assignee: Benjamin Bannier > Resource provider-capable agents not correctly synchronizing checkpointed > agent resources on reregistration > --- > > Key: MESOS-8350 > URL: https://issues.apache.org/jira/browse/MESOS-8350 > Project: Mesos > Issue Type: Bug > Components: master >Reporter: Benjamin Bannier >Assignee: Benjamin Bannier >Priority: Critical > > For resource provider-capable agents the master does not re-send checkpointed > resources on agent reregistration; instead the checkpointed resources sent as > part of the {{ReregisterSlaveMessage}} should be used. > This is not what happens in reality. If e.g., checkpointing of an offer > operation fails and the agent fails over the checkpointed resources would, as > expected, not be reflected in the agent, but would still be assumed in the > master. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (MESOS-8350) Resource provider-capable agents not correctly synchronizing checkpointed agent resources on reregistration
Benjamin Bannier created MESOS-8350: --- Summary: Resource provider-capable agents not correctly synchronizing checkpointed agent resources on reregistration Key: MESOS-8350 URL: https://issues.apache.org/jira/browse/MESOS-8350 Project: Mesos Issue Type: Bug Components: master Reporter: Benjamin Bannier Priority: Critical For resource provider-capable agents the master does not re-send checkpointed resources on agent reregistration; instead the checkpointed resources sent as part of the {{ReregisterSlaveMessage}} should be used. This is not what happens in reality. If e.g., checkpointing of an offer operation fails and the agent fails over the checkpointed resources would, as expected, not be reflected in the agent, but would still be assumed in the master. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-8349) When a resource provider driver is disconnected, it fails to reconnect.
[ https://issues.apache.org/jira/browse/MESOS-8349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16298557#comment-16298557 ] Jan Schlicht commented on MESOS-8349: - Discarding a {{Future}} (instead of discarding its {{Promise}}) won't call {{onAny}} callbacks, only a {{onDiscarded}} callback that we haven't set up here. > When a resource provider driver is disconnected, it fails to reconnect. > --- > > Key: MESOS-8349 > URL: https://issues.apache.org/jira/browse/MESOS-8349 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.5.0 >Reporter: Jan Schlicht >Assignee: Jan Schlicht > Labels: mesosphere > > If the resource provider manager closes the HTTP connection of a resource > provider, the resource provider should reconnect itself. For that, the > resource provider driver will change its state to "DISCONNECTED", call a > {{disconnected}} callback and use its endpoint detector to reconnect. > This doesn't work in a testing environment where a > {{ConstantEndpointDetector}} is used. While the resource provider is notified > of the closed HTTP connection (and logs {{End-Of-File received}}), it never > disconnects itself and calls the {{disconnected}} callback. Discarding > {{HttpConnectionProcess::detection}} in > {{HttpConnectionProcess::disconnected}} doesn't trigger the {{onAny}} > callback of that future. This might not be a problem in > {{HttpConnectionProcess}} but could be related to the test case using a > {{ConstantEndpointDetector}}. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Comment Edited] (MESOS-8267) NestedMesosContainerizerTest.ROOT_CGROUPS_RecoverLauncherOrphans is flaky.
[ https://issues.apache.org/jira/browse/MESOS-8267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16297244#comment-16297244 ] Alexander Rukletsov edited comment on MESOS-8267 at 12/20/17 2:03 PM: -- {noformat} Commit: 0cc636b2d5ad5c934b8b7f350bc8c99b9282b5ab [0cc636b] Author: Andrei BudnikDate: 19 December 2017 at 19:45:33 GMT+1 Committer: Alexander Rukletsov Fixed flaky `ROOT_CGROUPS_RecoverLauncherOrphans` test. Containerizer recovery returns control to the caller before completion of destruction of orphaned containers. Previously, `wait` was called on a container right after calling `recover`, so `wait` was almost always successfull, because destruction of the orphaned container takes some time to complete. This patch replaces check for the container existence with the check that a related freezer cgroup has been destroyed. The freezer cgroup is destroyed during container destruction initiated by a containerizer recovery process. Review: https://reviews.apache.org/r/64680/ {noformat} {noformat} Commit: 196fe20861a4efc8edcabb08ce8821e8ed1b8f02 [196fe20] Author: Andrei Budnik Date: 20 December 2017 at 14:59:50 GMT+1 Committer: Alexander Rukletsov Fixed flaky `NestedMesosContainerizerTest` tests. This patch is an addition to commit 0cc636b2d5. Review: https://reviews.apache.org/r/64749/ {noformat} was (Author: alexr): {noformat} Commit: 0cc636b2d5ad5c934b8b7f350bc8c99b9282b5ab [0cc636b] Author: Andrei Budnik Date: 19 December 2017 at 19:45:33 GMT+1 Committer: Alexander Rukletsov Fixed flaky `ROOT_CGROUPS_RecoverLauncherOrphans` test. Containerizer recovery returns control to the caller before completion of destruction of orphaned containers. Previously, `wait` was called on a container right after calling `recover`, so `wait` was almost always successfull, because destruction of the orphaned container takes some time to complete. This patch replaces check for the container existence with the check that a related freezer cgroup has been destroyed. The freezer cgroup is destroyed during container destruction initiated by a containerizer recovery process. Review: https://reviews.apache.org/r/64680/ {noformat} > NestedMesosContainerizerTest.ROOT_CGROUPS_RecoverLauncherOrphans is flaky. > -- > > Key: MESOS-8267 > URL: https://issues.apache.org/jira/browse/MESOS-8267 > Project: Mesos > Issue Type: Bug > Components: test > Environment: Ubuntu 16.04 >Reporter: Alexander Rukletsov >Assignee: Andrei Budnik > Labels: flaky-test > Fix For: 1.5.0 > > Attachments: ROOT_CGROUPS_RecoverLauncherOrphans-badrun.txt > > > {noformat} > ../../src/tests/containerizer/nested_mesos_containerizer_tests.cpp:1902 > wait.get() is NONE > {noformat} > Full log attached. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (MESOS-8349) When a resource provider driver is disconnected, it fails to reconnect.
Jan Schlicht created MESOS-8349: --- Summary: When a resource provider driver is disconnected, it fails to reconnect. Key: MESOS-8349 URL: https://issues.apache.org/jira/browse/MESOS-8349 Project: Mesos Issue Type: Bug Affects Versions: 1.5.0 Reporter: Jan Schlicht Assignee: Jan Schlicht If the resource provider manager closes the HTTP connection of a resource provider, the resource provider should reconnect itself. For that, the resource provider driver will change its state to "DISCONNECTED", call a {{disconnected}} callback and use its endpoint detector to reconnect. This doesn't work in a testing environment where a {{ConstantEndpointDetector}} is used. While the resource provider is notified of the closed HTTP connection (and logs {{End-Of-File received}}), it never disconnects itself and calls the {{disconnected}} callback. Discarding {{HttpConnectionProcess::detection}} in {{HttpConnectionProcess::disconnected}} doesn't trigger the {{onAny}} callback of that future. This might not be a problem in {{HttpConnectionProcess}} but could be related to the test case using a {{ConstantEndpointDetector}}. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-8335) ProvisionerDockerTest fails on Debian 9
[ https://issues.apache.org/jira/browse/MESOS-8335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Armand Grillet updated MESOS-8335: -- Description: Version of Docker used: Docker version 17.11.0-ce, build 1caf76c Version of Curl used: curl 7.52.1 (x86_64-pc-linux-gnu) libcurl/7.52.1 OpenSSL/1.0.2l zlib/1.2.8 libidn2/0.16 libpsl/0.17.0 (+libidn2/0.16) libssh2/1.7.0 nghttp2/1.18.1 librtmp/2.3 Error: {code} [ RUN ] ImageAlpine/ProvisionerDockerTest.ROOT_INTERNET_CURL_SimpleCommand/2 I1215 00:09:28.694677 19343 cluster.cpp:172] Creating default 'local' authorizer I1215 00:09:28.697144 30867 master.cpp:456] Master 75b48a47-7b6b-4e60-82d3-dfdc0cf8bff3 (ip-172-16-10-160.ec2.internal) started on 127.0.1.1:35029 I1215 00:09:28.697163 30867 master.cpp:458] Flags at startup: --acls="" --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" --allocation_interval="1secs" --allocator="HierarchicalDRF" --authenticate_agents="true" --authenticate_frameworks="true" --authenticate_http_frameworks="true" --authenticate_http_readonly="true" --authenticate_http_readwrite="true" --authenticators="crammd5" --authorizers="local" --credentials="/tmp/4RYdF1/credentials" --filter_gpu_resources="true" --framework_sorter="drf" --help="false" --hostname_lookup="true" --http_authenticators="basic" --http_framework_authenticators="basic" --initialize_driver_logging="true" --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" --max_unreachable_tasks_per_framework="1000" --port="5050" --quiet="false" --recovery_agent_removal_limit="100%" --registry="in_memory" --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" --registry_store_timeout="100secs" --registry_strict="false" --root_submissions="true" --user_sorter="drf" --version="false" --webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/4RYdF1/master" --zk_session_timeout="10secs" I1215 00:09:28.697413 30867 master.cpp:507] Master only allowing authenticated frameworks to register I1215 00:09:28.697422 30867 master.cpp:513] Master only allowing authenticated agents to register I1215 00:09:28.697427 30867 master.cpp:519] Master only allowing authenticated HTTP frameworks to register I1215 00:09:28.697433 30867 credentials.hpp:37] Loading credentials for authentication from '/tmp/4RYdF1/credentials' I1215 00:09:28.697654 30867 master.cpp:563] Using default 'crammd5' authenticator I1215 00:09:28.697806 30867 http.cpp:1045] Creating default 'basic' HTTP authenticator for realm 'mesos-master-readonly' I1215 00:09:28.697962 30867 http.cpp:1045] Creating default 'basic' HTTP authenticator for realm 'mesos-master-readwrite' I1215 00:09:28.698076 30867 http.cpp:1045] Creating default 'basic' HTTP authenticator for realm 'mesos-master-scheduler' I1215 00:09:28.698194 30867 master.cpp:642] Authorization enabled I1215 00:09:28.698468 30864 hierarchical.cpp:175] Initialized hierarchical allocator process I1215 00:09:28.698563 30864 whitelist_watcher.cpp:77] No whitelist given I1215 00:09:28.701695 30871 master.cpp:2209] Elected as the leading master! I1215 00:09:28.701723 30871 master.cpp:1689] Recovering from registrar I1215 00:09:28.701859 30869 registrar.cpp:347] Recovering registrar I1215 00:09:28.702401 30869 registrar.cpp:391] Successfully fetched the registry (0B) in 507904ns I1215 00:09:28.702495 30869 registrar.cpp:495] Applied 1 operations in 28977ns; attempting to update the registry I1215 00:09:28.702997 30869 registrar.cpp:552] Successfully updated the registry in 464896ns I1215 00:09:28.703086 30869 registrar.cpp:424] Successfully recovered registrar I1215 00:09:28.703640 30865 master.cpp:1802] Recovered 0 agents from the registry (167B); allowing 10mins for agents to re-register I1215 00:09:28.703661 30869 hierarchical.cpp:213] Skipping recovery of hierarchical allocator: nothing to recover W1215 00:09:28.706816 19343 process.cpp:2756] Attempted to spawn already running process files@127.0.1.1:35029 I1215 00:09:28.707818 19343 containerizer.cpp:304] Using isolation { environment_secret, volume/sandbox_path, volume/host_path, docker/runtime, filesystem/linux, network/cni, volume/image } I1215 00:09:28.712363 19343 linux_launcher.cpp:146] Using /sys/fs/cgroup/freezer as the freezer hierarchy for the Linux launcher sh: 1: hadoop: not found I1215 00:09:28.740589 19343 fetcher.cpp:69] Skipping URI fetcher plugin 'hadoop' as it could not be created: Failed to create HDFS client: Hadoop client is not available, exit status: 32512 I1215 00:09:28.740788 19343 registry_puller.cpp:129] Creating registry puller with docker registry 'https://registry-1.docker.io' I1215 00:09:28.742266 19343 provisioner.cpp:299] Using default backend 'overlay' I1215 00:09:28.745649
[jira] [Commented] (MESOS-8346) Resubscription of a resource provider will crash the agent if its HTTP connection isn't closed
[ https://issues.apache.org/jira/browse/MESOS-8346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16298138#comment-16298138 ] Jan Schlicht commented on MESOS-8346: - It will land today, the patch seems to be good, just needs a small update. > Resubscription of a resource provider will crash the agent if its HTTP > connection isn't closed > -- > > Key: MESOS-8346 > URL: https://issues.apache.org/jira/browse/MESOS-8346 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.5.0 >Reporter: Jan Schlicht >Assignee: Jan Schlicht >Priority: Blocker > Labels: mesosphere > > A resource provider might resubscribe while its old HTTP connection wasn't > properly closed. In that case an agent will crashm with, e.g., the following > log: > {noformat} > I1219 13:33:51.937295 128610304 manager.cpp:570] Subscribing resource > provider > {"id":{"value":"8e71beef-796e-4bde-9257-952ed0f230a5"},"name":"test","type":"org.apache.mesos.rp.test"} > I1219 13:33:51.937443 128610304 manager.cpp:134] Terminating resource > provider 8e71beef-796e-4bde-9257-952ed0f230a5 > I1219 13:33:51.937760 128610304 manager.cpp:134] Terminating resource > provider 8e71beef-796e-4bde-9257-952ed0f230a5 > E1219 13:33:51.937851 129683456 http_connection.hpp:445] End-Of-File received > I1219 13:33:51.937865 131293184 slave.cpp:7105] Handling resource provider > message 'DISCONNECT: resource provider 8e71beef-796e-4bde-9257-952ed0f230a5' > I1219 13:33:51.937968 131293184 slave.cpp:7347] Forwarding new total > resources cpus:2; mem:1024; disk:1024; ports:[31000-32000] > F1219 13:33:51.938052 132366336 manager.cpp:606] Check failed: > resourceProviders.subscribed.contains(resourceProviderId) > *** Check failure stack trace: *** > E1219 13:33:51.938583 130756608 http_connection.hpp:445] End-Of-File received > I1219 13:33:51.938987 129683456 hierarchical.cpp:669] Agent > 0019c3fa-28c5-43a9-88d0-709eee271c62-S0 (172.18.8.13) updated with total > resources cpus:2; mem:1024; disk:1024; ports:[31000-32000] > @0x1125380ef google::LogMessageFatal::~LogMessageFatal() > @0x112534ae9 google::LogMessageFatal::~LogMessageFatal() > I1219 13:33:51.939131 129683456 hierarchical.cpp:1517] Performed allocation > for 1 agents in 61830ns > I1219 13:33:51.945793 2646795072 slave.cpp:927] Agent terminating > I1219 13:33:51.945955 129146880 master.cpp:1305] Agent > 0019c3fa-28c5-43a9-88d0-709eee271c62-S0 at slave(1)@172.18.8.13:64430 > (172.18.8.13) disconnected > I1219 13:33:51.945979 129146880 master.cpp:3364] Disconnecting agent > 0019c3fa-28c5-43a9-88d0-709eee271c62-S0 at slave(1)@172.18.8.13:64430 > (172.18.8.13) > I1219 13:33:51.946022 129146880 master.cpp:3383] Deactivating agent > 0019c3fa-28c5-43a9-88d0-709eee271c62-S0 at slave(1)@172.18.8.13:64430 > (172.18.8.13) > I1219 13:33:51.946081 131293184 hierarchical.cpp:766] Agent > 0019c3fa-28c5-43a9-88d0-709eee271c62-S0 deactivated > @0x115f2761d > mesos::internal::ResourceProviderManagerProcess::subscribe()::$_2::operator()() > @0x115f2977d > _ZN5cpp176invokeIZN5mesos8internal30ResourceProviderManagerProcess9subscribeERKNS2_14HttpConnectionERKNS1_17resource_provider14Call_SubscribeEE3$_2JN7process6FutureI7NothingEDTclclsr3stdE7forwardIT_Efp_Espclsr3stdE7forwardIT0_Efp0_EEEOSG_DpOSH_ > @0x115f29740 > _ZN6lambda8internal7PartialIZN5mesos8internal30ResourceProviderManagerProcess9subscribeERKNS3_14HttpConnectionERKNS2_17resource_provider14Call_SubscribeEE3$_2JN7process6FutureI7Nothing13invoke_expandISC_NSt3__15tupleIJSG_EEENSK_IJEEEJLm0DTclsr5cpp17E6invokeclsr3stdE7forwardIT_Efp_Espcl6expandclsr3stdE3getIXT2_EEclsr3stdE7forwardIT0_Efp0_EEclsr3stdE7forwardIT1_Efp2_OSN_OSO_N5cpp1416integer_sequenceImJXspT2_OSP_ > @0x115f296bb > _ZNO6lambda8internal7PartialIZN5mesos8internal30ResourceProviderManagerProcess9subscribeERKNS3_14HttpConnectionERKNS2_17resource_provider14Call_SubscribeEE3$_2JN7process6FutureI7NothingclIJEEEDTcl13invoke_expandclL_ZNSt3__14moveIRSC_EEONSJ_16remove_referenceIT_E4typeEOSN_EdtdefpT1fEclL_ZNSK_IRNSJ_5tupleIJSG_ESQ_SR_EdtdefpT10bound_argsEcvN5cpp1416integer_sequenceImJLm0_Eclsr3stdE16forward_as_tuplespclsr3stdE7forwardIT_Efp_DpOSY_ > @0x115f2965d > _ZN5cpp176invokeIN6lambda8internal7PartialIZN5mesos8internal30ResourceProviderManagerProcess9subscribeERKNS5_14HttpConnectionERKNS4_17resource_provider14Call_SubscribeEE3$_2JN7process6FutureI7NothingEJEEEDTclclsr3stdE7forwardIT_Efp_Espclsr3stdE7forwardIT0_Efp0_EEEOSK_DpOSL_ > @0x115f29631 >
[jira] [Commented] (MESOS-6623) Re-enable tests impacted by request streaming support
[ https://issues.apache.org/jira/browse/MESOS-6623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16298114#comment-16298114 ] Gilbert Song commented on MESOS-6623: - Seems like this is targeted for 1.5.0. Do we have an estimate when it will land? /cc [~anandmazumdar] > Re-enable tests impacted by request streaming support > - > > Key: MESOS-6623 > URL: https://issues.apache.org/jira/browse/MESOS-6623 > Project: Mesos > Issue Type: Bug > Components: HTTP API, test >Reporter: Anand Mazumdar >Assignee: Anand Mazumdar >Priority: Critical > Labels: mesosphere > > We added support for HTTP request streaming in libprocess as part of > MESOS-6466. However, this broke a few tests that relied on HTTP request > filtering since the handlers no longer have access to the body of the request > when {{visit()}} is invoked. We would need to revisit how we do HTTP request > filtering and then re-enable these tests. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-8346) Resubscription of a resource provider will crash the agent if its HTTP connection isn't closed
[ https://issues.apache.org/jira/browse/MESOS-8346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16298109#comment-16298109 ] Gilbert Song commented on MESOS-8346: - [~nfnt], seems like this is targeted for 1.5.0. Do we have an estimate when it will land? > Resubscription of a resource provider will crash the agent if its HTTP > connection isn't closed > -- > > Key: MESOS-8346 > URL: https://issues.apache.org/jira/browse/MESOS-8346 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.5.0 >Reporter: Jan Schlicht >Assignee: Jan Schlicht >Priority: Blocker > Labels: mesosphere > > A resource provider might resubscribe while its old HTTP connection wasn't > properly closed. In that case an agent will crashm with, e.g., the following > log: > {noformat} > I1219 13:33:51.937295 128610304 manager.cpp:570] Subscribing resource > provider > {"id":{"value":"8e71beef-796e-4bde-9257-952ed0f230a5"},"name":"test","type":"org.apache.mesos.rp.test"} > I1219 13:33:51.937443 128610304 manager.cpp:134] Terminating resource > provider 8e71beef-796e-4bde-9257-952ed0f230a5 > I1219 13:33:51.937760 128610304 manager.cpp:134] Terminating resource > provider 8e71beef-796e-4bde-9257-952ed0f230a5 > E1219 13:33:51.937851 129683456 http_connection.hpp:445] End-Of-File received > I1219 13:33:51.937865 131293184 slave.cpp:7105] Handling resource provider > message 'DISCONNECT: resource provider 8e71beef-796e-4bde-9257-952ed0f230a5' > I1219 13:33:51.937968 131293184 slave.cpp:7347] Forwarding new total > resources cpus:2; mem:1024; disk:1024; ports:[31000-32000] > F1219 13:33:51.938052 132366336 manager.cpp:606] Check failed: > resourceProviders.subscribed.contains(resourceProviderId) > *** Check failure stack trace: *** > E1219 13:33:51.938583 130756608 http_connection.hpp:445] End-Of-File received > I1219 13:33:51.938987 129683456 hierarchical.cpp:669] Agent > 0019c3fa-28c5-43a9-88d0-709eee271c62-S0 (172.18.8.13) updated with total > resources cpus:2; mem:1024; disk:1024; ports:[31000-32000] > @0x1125380ef google::LogMessageFatal::~LogMessageFatal() > @0x112534ae9 google::LogMessageFatal::~LogMessageFatal() > I1219 13:33:51.939131 129683456 hierarchical.cpp:1517] Performed allocation > for 1 agents in 61830ns > I1219 13:33:51.945793 2646795072 slave.cpp:927] Agent terminating > I1219 13:33:51.945955 129146880 master.cpp:1305] Agent > 0019c3fa-28c5-43a9-88d0-709eee271c62-S0 at slave(1)@172.18.8.13:64430 > (172.18.8.13) disconnected > I1219 13:33:51.945979 129146880 master.cpp:3364] Disconnecting agent > 0019c3fa-28c5-43a9-88d0-709eee271c62-S0 at slave(1)@172.18.8.13:64430 > (172.18.8.13) > I1219 13:33:51.946022 129146880 master.cpp:3383] Deactivating agent > 0019c3fa-28c5-43a9-88d0-709eee271c62-S0 at slave(1)@172.18.8.13:64430 > (172.18.8.13) > I1219 13:33:51.946081 131293184 hierarchical.cpp:766] Agent > 0019c3fa-28c5-43a9-88d0-709eee271c62-S0 deactivated > @0x115f2761d > mesos::internal::ResourceProviderManagerProcess::subscribe()::$_2::operator()() > @0x115f2977d > _ZN5cpp176invokeIZN5mesos8internal30ResourceProviderManagerProcess9subscribeERKNS2_14HttpConnectionERKNS1_17resource_provider14Call_SubscribeEE3$_2JN7process6FutureI7NothingEDTclclsr3stdE7forwardIT_Efp_Espclsr3stdE7forwardIT0_Efp0_EEEOSG_DpOSH_ > @0x115f29740 > _ZN6lambda8internal7PartialIZN5mesos8internal30ResourceProviderManagerProcess9subscribeERKNS3_14HttpConnectionERKNS2_17resource_provider14Call_SubscribeEE3$_2JN7process6FutureI7Nothing13invoke_expandISC_NSt3__15tupleIJSG_EEENSK_IJEEEJLm0DTclsr5cpp17E6invokeclsr3stdE7forwardIT_Efp_Espcl6expandclsr3stdE3getIXT2_EEclsr3stdE7forwardIT0_Efp0_EEclsr3stdE7forwardIT1_Efp2_OSN_OSO_N5cpp1416integer_sequenceImJXspT2_OSP_ > @0x115f296bb > _ZNO6lambda8internal7PartialIZN5mesos8internal30ResourceProviderManagerProcess9subscribeERKNS3_14HttpConnectionERKNS2_17resource_provider14Call_SubscribeEE3$_2JN7process6FutureI7NothingclIJEEEDTcl13invoke_expandclL_ZNSt3__14moveIRSC_EEONSJ_16remove_referenceIT_E4typeEOSN_EdtdefpT1fEclL_ZNSK_IRNSJ_5tupleIJSG_ESQ_SR_EdtdefpT10bound_argsEcvN5cpp1416integer_sequenceImJLm0_Eclsr3stdE16forward_as_tuplespclsr3stdE7forwardIT_Efp_DpOSY_ > @0x115f2965d > _ZN5cpp176invokeIN6lambda8internal7PartialIZN5mesos8internal30ResourceProviderManagerProcess9subscribeERKNS5_14HttpConnectionERKNS4_17resource_provider14Call_SubscribeEE3$_2JN7process6FutureI7NothingEJEEEDTclclsr3stdE7forwardIT_Efp_Espclsr3stdE7forwardIT0_Efp0_EEEOSK_DpOSL_ > @0x115f29631 >
[jira] [Commented] (MESOS-8297) Built-in driver-based executors ignore kill task if the task has not been launched.
[ https://issues.apache.org/jira/browse/MESOS-8297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16298108#comment-16298108 ] Gilbert Song commented on MESOS-8297: - Seems like this is targeted for 1.5.0. Do we have an estimate when it will land? /cc [~alexr] > Built-in driver-based executors ignore kill task if the task has not been > launched. > --- > > Key: MESOS-8297 > URL: https://issues.apache.org/jira/browse/MESOS-8297 > Project: Mesos > Issue Type: Bug > Components: executor >Reporter: Alexander Rukletsov >Assignee: Alexander Rukletsov >Priority: Blocker > Labels: mesosphere > > If docker executor receives a kill task request and the task has never been > launch, the request is ignored. We now know that: the executor has never > received the registration confirmation, hence has ignored the launch task > request, hence the task has never started. And this is how the executor > enters an idle state, waiting for registration and ignoring kill task > requests. -- This message was sent by Atlassian JIRA (v6.4.14#64029)