[jira] [Created] (MESOS-8353) Duplicate task for same framework on multiple agents crashes out master after failover

2017-12-20 Thread Zhitao Li (JIRA)
Zhitao Li created MESOS-8353:


 Summary: Duplicate task for same framework on multiple agents 
crashes out master after failover
 Key: MESOS-8353
 URL: https://issues.apache.org/jira/browse/MESOS-8353
 Project: Mesos
  Issue Type: Bug
Reporter: Zhitao Li


We have seen a mesos master crash loop after a leader failover. After more 
investigation, it seems that a same task ID was managed to be created onto 
multiple Mesos agents in the cluster. 

One possible logical sequence which can lead to such problem:

1. Task T1 was launched to master M1 on agent A1 for framework F;
2. Master M1 failed over to M2;
3. Before A1 reregistered to M2, the same T1 was launched on to agent A2: M2 
does not know previous T1 yet so it accepted it and sent to A2;
4. A1 reregistered: this probably crashed M2 (because same task cannot be added 
twice);
5. When M3 tries to come up after M2, it further crashes because both A1 and A2 
tried to add a T1 to the framework.

(I only have logs to prove the last step right now)

This happened on 1.4.0 masters.

Although this is probably triggered by incorrect retry logic on framework side, 
I wonder whether Mesos master should do extra protection to prevent such issue 
to happen. One possible idea to instruct one of the agents carrying tasks w/ 
duplicate ID to terminate corresponding tasks, or just refuse to reregister 
such agents and instruct them to shutdown.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8352) Resources may get over allocated to some roles while fail to meet the quota of other roles.

2017-12-20 Thread Meng Zhu (JIRA)
Meng Zhu created MESOS-8352:
---

 Summary: Resources may get over allocated to some roles while fail 
to meet the quota of other roles.
 Key: MESOS-8352
 URL: https://issues.apache.org/jira/browse/MESOS-8352
 Project: Mesos
  Issue Type: Bug
  Components: allocation
Reporter: Meng Zhu
Assignee: Meng Zhu


In the quota role allocation stage, if a role gets some resources on an agent 
to meet its quota, it will also get all other resources on the same agent that 
it does not have quota for. This may starve roles behind it that have quotas 
set for those resources.

To fix that, we need to track quota headroom in the quota role allocation 
stage. In that stage, if a role has no quota set for a scalar resource, it will 
get that resource only when two conditions are both met:

- It got some other resources on the same agent to meet its quota; And

- After allocating those resources, quota headroom is still above the required 
amount.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-8224) mesos.interface 1.4.0 cannot be installed with pip

2017-12-20 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16299469#comment-16299469
 ] 

Benjamin Mahler commented on MESOS-8224:


[~karya] can you take a look at the 1.4.0 egg issue?

> mesos.interface 1.4.0 cannot be installed with pip
> --
>
> Key: MESOS-8224
> URL: https://issues.apache.org/jira/browse/MESOS-8224
> Project: Mesos
>  Issue Type: Task
>  Components: release
>Reporter: Bill Farner
>
> This breaks some framework development tooling.
> WIth latest pip:
> {noformat}
> $ python -m pip -V
> pip 9.0.1 from 
> /Users/wfarner/code/aurora/build-support/python/pycharm.venv/lib/python2.7/site-packages
>  (python 2.7)
> {noformat}
> This works fine for previous releases:
> {noformat}
> $ python -m pip install mesos.interface==1.3.0
> Collecting mesos.interface==1.3.0
> ...
> Installing collected packages: mesos.interface
> Successfully installed mesos.interface-1.3.0
> {noformat}
> But it does not for 1.4.0:
> {noformat}
> $ python -m pip install mesos.interface==1.4.0
> Collecting mesos.interface==1.4.0
>   Could not find a version that satisfies the requirement 
> mesos.interface==1.4.0 (from versions: 0.21.2.linux-x86_64, 
> 0.22.1.2.linux-x86_64, 0.22.2.linux-x86_64, 0.23.1.linux-x86_64, 
> 0.24.1.linux-x86_64, 0.24.2.linux-x86_64, 0.25.0.linux-x86_64, 
> 0.25.1.linux-x86_64, 0.26.1.linux-x86_64, 0.27.0.linux-x86_64, 
> 0.27.1.linux-x86_64, 0.27.2.linux-x86_64, 0.28.0.linux-x86_64, 
> 0.28.1.linux-x86_64, 0.28.2.linux-x86_64, 1.0.0.linux-x86_64, 
> 1.0.1.linux-x86_64, 1.1.0.linux-x86_64, 1.2.0.linux-x86_64, 
> 1.3.0.linux-x86_64, 0.20.0, 0.20.1, 0.21.0, 0.21.1, 0.21.2, 0.22.0, 0.22.1.2, 
> 0.22.2, 0.23.0, 0.23.1, 0.24.0, 0.24.1, 0.24.2, 0.25.0, 0.25.1, 0.26.0, 
> 0.26.1, 0.27.0, 0.27.1, 0.27.2, 0.28.0, 0.28.1, 0.28.2, 1.0.0, 1.0.1, 1.1.0, 
> 1.2.0, 1.3.0)
> No matching distribution found for mesos.interface==1.4.0
> {noformat}
> Verbose output shows that pip skips the 1.4.0 distribution:
> {noformat}
> $ python -m pip install -v mesos.interface==1.4.0 | grep 1.4.0
> Collecting mesos.interface==1.4.0
> Skipping link 
> https://pypi.python.org/packages/ef/1b/d5b0c1456f755ad42477eaa9667e22d1f5fd8e2fce0f9b26937f93743f6c/mesos.interface-1.4.0-py2.7.egg#md5=32113860961d49c31f69f7b13a9bc063
>  (from https://pypi.python.org/simple/mesos-interface/); unsupported archive 
> format: .egg
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8351) Improve mesos-style.py linters instantiation

2017-12-20 Thread Armand Grillet (JIRA)
Armand Grillet created MESOS-8351:
-

 Summary: Improve mesos-style.py linters instantiation
 Key: MESOS-8351
 URL: https://issues.apache.org/jira/browse/MESOS-8351
 Project: Mesos
  Issue Type: Improvement
Reporter: Armand Grillet
Assignee: Armand Grillet


Currently, all linters are instantiated when running mesos-style.py without 
checking first if they will be needed or not. In the main function of 
mesos-style.py, a check should be known to know which linters to instantiate 
and use depending on the input given (the list of files that should be linted).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8350) Resource provider-capable agents not correctly synchronizing checkpointed agent resources on reregistration

2017-12-20 Thread Benjamin Bannier (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Bannier updated MESOS-8350:

Description: 
For resource provider-capable agents the master does not re-send checkpointed 
resources on agent reregistration; instead the checkpointed resources sent as 
part of the {{ReregisterSlaveMessage}} should be used.

This is not what happens in reality. If e.g., checkpointing of an offer 
operation fails and the agent fails over the checkpointed resources would, as 
expected, not be reflected in the agent, but would still be assumed in the 
master.

A workaround is to fail over the master which would lead to the newly elected 
master bootstrapping agent state from {{ReregisterSlaveMessage}}.

  was:
For resource provider-capable agents the master does not re-send checkpointed 
resources on agent reregistration; instead the checkpointed resources sent as 
part of the {{ReregisterSlaveMessage}} should be used.

This is not what happens in reality. If e.g., checkpointing of an offer 
operation fails and the agent fails over the checkpointed resources would, as 
expected, not be reflected in the agent, but would still be assumed in the 
master.


> Resource provider-capable agents not correctly synchronizing checkpointed 
> agent resources on reregistration
> ---
>
> Key: MESOS-8350
> URL: https://issues.apache.org/jira/browse/MESOS-8350
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Reporter: Benjamin Bannier
>Assignee: Benjamin Bannier
>Priority: Critical
>
> For resource provider-capable agents the master does not re-send checkpointed 
> resources on agent reregistration; instead the checkpointed resources sent as 
> part of the {{ReregisterSlaveMessage}} should be used.
> This is not what happens in reality. If e.g., checkpointing of an offer 
> operation fails and the agent fails over the checkpointed resources would, as 
> expected, not be reflected in the agent, but would still be assumed in the 
> master.
> A workaround is to fail over the master which would lead to the newly elected 
> master bootstrapping agent state from {{ReregisterSlaveMessage}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (MESOS-6616) Error: dereferencing type-punned pointer will break strict-aliasing rules.

2017-12-20 Thread Alexander Rukletsov (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16295472#comment-16295472
 ] 

Alexander Rukletsov edited comment on MESOS-6616 at 12/20/17 7:32 PM:
--

I'm able to reproduce it on Debian 8.10 + gcc 5.5.0-6 with {{O2}} enabled.
{noformat}
admin@ip-10-178-168-202:~/mesos/build$ gcc --version
gcc (Debian 5.5.0-6) 5.5.0 20171010
{noformat}
{noformat}
admin@ip-10-178-168-202:~/mesos/build$ lsb_release -a
No LSB modules are available.
Distributor ID: Debian
Description:Debian GNU/Linux 8.10 (jessie)
Release:8.10
Codename:   jessie
{noformat}
{noformat}
../configure --enable-libevent --enable-ssl --disable-java --enable-optimize 
--disable-silent-rules --enable-install-module-dependencies
{noformat}
The compile error is likely triggered via {{--enable-optimize}} which turns 
{{O2}} optimizations on.

It looks like certain versions of {{gcc}} mistakenly (?) assume that any 
dereferencing of a type-punned pointer violates strict-aliasing rules.

More about strict aliasing and type-punned pointers: 
https://blog.regehr.org/archives/959 and 
https://gcc.gnu.org/onlinedocs/gcc-4.6.0/gnat_ugn_unw/Optimization-and-Strict-Aliasing.html

More about {{cmsghdr}} structure:
http://alas.matf.bg.ac.rs/manuals/lspe/snode=153.html


was (Author: alexr):
I'm able to reproduce it on Debian 8.10 + gcc 5.5.0-6 with {{O2}} enabled.
{noformat}
admin@ip-10-178-168-202:~/mesos/build$ gcc --version
gcc (Debian 5.5.0-6) 5.5.0 20171010
{noformat}
{noformat}
admin@ip-10-178-168-202:~/mesos/build$ lsb_release -a
No LSB modules are available.
Distributor ID: Debian
Description:Debian GNU/Linux 8.10 (jessie)
Release:8.10
Codename:   jessie
{noformat}
{noformat}
../configure --enable-libevent --enable-ssl --disable-java --enable-optimize 
--disable-silent-rules --enable-install-module-dependencies
{noformat}
The compile error is likely triggered via {{--enable-optimize}} which turns 
{{O2}} optimizations on.

It looks like certain versions of {{gcc}} mistakenly (?) assume that any 
dereferencing of a type-punned pointer violates strict-aliasing rules.

More about strict aliasing and type-punned pointers: 
https://blog.regehr.org/archives/959 and 
https://gcc.gnu.org/onlinedocs/gcc-4.6.0/gnat_ugn_unw/Optimization-and-Strict-Aliasing.html

> Error: dereferencing type-punned pointer will break strict-aliasing rules.
> --
>
> Key: MESOS-6616
> URL: https://issues.apache.org/jira/browse/MESOS-6616
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.1.0, 1.2.3, 1.3.1, 1.4.1
> Environment: Fedora Rawhide;
> Debian 8.10 + gcc 5.5.0-6 with {{O2}}
>Reporter: Orion Poplawski
>Assignee: Alexander Rukletsov
>  Labels: compile-error, mesosphere
>
> Trying to update the mesos package to 1.1.0 in Fedora.  Getting:
> {noformat}
> libtool: compile:  g++ -DPACKAGE_NAME=\"mesos\" -DPACKAGE_TARNAME=\"mesos\" 
> -DPACKAGE_VERSION=\"1.1.0\" "-DPACKAGE_STRING=\"mesos 1.1.0\"" 
> -DPACKAGE_BUGREPORT=\"\" -DPACKAGE_URL=\"\" -DPACKAGE=\"mesos\" 
> -DVERSION=\"1.1.0\" -DSTDC_HEADERS=1 -DHAVE_SYS_TYPES_H=1 -DHAVE_SYS_STAT_H=1 
> -DHAVE_STDLIB_H=1 -DHAVE_STRING_H=1 -DHAVE_MEMORY_H=1 -DHAVE_STRINGS_H=1 
> -DHAVE_INTTYPES_H=1 -DHAVE_STDINT_H=1 -DHAVE_UNISTD_H=1 -DHAVE_DLFCN_H=1 
> -DLT_OBJDIR=\".libs/\" -DHAVE_CXX11=1 -DHAVE_PTHREAD_PRIO_INHERIT=1 
> -DHAVE_PTHREAD=1 -DHAVE_LIBZ=1 -DHAVE_FTS_H=1 -DHAVE_APR_POOLS_H=1 
> -DHAVE_LIBAPR_1=1 -DHAVE_BOOST_VERSION_HPP=1 -DHAVE_LIBCURL=1 
> -DHAVE_ELFIO_ELFIO_HPP=1 -DHAVE_GLOG_LOGGING_H=1 -DHAVE_HTTP_PARSER_H=1 
> -DMESOS_HAS_JAVA=1 -DHAVE_LEVELDB_DB_H=1 -DHAVE_LIBNL_3=1 
> -DHAVE_LIBNL_ROUTE_3=1 -DHAVE_LIBNL_IDIAG_3=1 -DWITH_NETWORK_ISOLATOR=1 
> -DHAVE_GOOGLE_PROTOBUF_MESSAGE_H=1 -DHAVE_EV_H=1 -DHAVE_PICOJSON_H=1 
> -DHAVE_LIBSASL2=1 -DHAVE_SVN_VERSION_H=1 -DHAVE_LIBSVN_SUBR_1=1 
> -DHAVE_SVN_DELTA_H=1 -DHAVE_LIBSVN_DELTA_1=1 -DHAVE_LIBZ=1 
> -DHAVE_ZOOKEEPER_H=1 -DHAVE_PYTHON=\"2.7\" -DMESOS_HAS_PYTHON=1 -I. -Wall 
> -Werror -Wsign-compare -DLIBDIR=\"/usr/lib64\" 
> -DPKGLIBEXECDIR=\"/usr/libexec/mesos\" -DPKGDATADIR=\"/usr/share/mesos\" 
> -DPKGMODULEDIR=\"/usr/lib64/mesos/modules\" -I../include -I../include 
> -I../include/mesos -DPICOJSON_USE_INT64 -D__STDC_FORMAT_MACROS 
> -I../3rdparty/libprocess/include -I../3rdparty/nvml-352.79 
> -I../3rdparty/stout/include -DHAS_AUTHENTICATION=1 -Iyes/include 
> -I/usr/include/subversion-1 -Iyes/include -Iyes/include -Iyes/include/libnl3 
> -Iyes/include -I/ -Iyes/include -I/usr/include/apr-1 -I/usr/include/apr-1.0 
> -I/builddir/build/BUILD/mesos-1.1.0/libev-4.15/include -isystem yes/include 
> -Iyes/include -I/usr/src/gmock -I/usr/src/gmock/include -I/usr/src/gmock/src 
> -I/usr/src/gmock/gtest -I/usr/src/gmock/gtest/include 
> 

[jira] [Updated] (MESOS-8335) ProvisionerDockerTest fails on Debian 9

2017-12-20 Thread Armand Grillet (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Armand Grillet updated MESOS-8335:
--
Description: 
Version of Docker used: Docker version 17.11.0-ce, build 1caf76c
Version of Curl used: curl 7.52.1 (x86_64-pc-linux-gnu) libcurl/7.52.1 
OpenSSL/1.0.2l zlib/1.2.8 libidn2/0.16 libpsl/0.17.0 (+libidn2/0.16) 
libssh2/1.7.0 nghttp2/1.18.1 librtmp/2.3

Error:
{code}
[ RUN  ] 
ImageAlpine/ProvisionerDockerTest.ROOT_INTERNET_CURL_SimpleCommand/2
I1215 00:09:28.694677 19343 cluster.cpp:172] Creating default 'local' authorizer
I1215 00:09:28.697144 30867 master.cpp:456] Master 
75b48a47-7b6b-4e60-82d3-dfdc0cf8bff3 (ip-172-16-10-160.ec2.internal) started on 
127.0.1.1:35029
I1215 00:09:28.697163 30867 master.cpp:458] Flags at startup: --acls="" 
--agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
--allocation_interval="1secs" --allocator="HierarchicalDRF" 
--authenticate_agents="true" --authenticate_frameworks="true" 
--authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
--authenticate_http_readwrite="true" --authenticators="crammd5" 
--authorizers="local" --credentials="/tmp/4RYdF1/credentials" 
--filter_gpu_resources="true" --framework_sorter="drf" --help="false" 
--hostname_lookup="true" --http_authenticators="basic" 
--http_framework_authenticators="basic" --initialize_driver_logging="true" 
--log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" 
--max_agent_ping_timeouts="5" --max_completed_frameworks="50" 
--max_completed_tasks_per_framework="1000" 
--max_unreachable_tasks_per_framework="1000" --port="5050" --quiet="false" 
--recovery_agent_removal_limit="100%" --registry="in_memory" 
--registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
--registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
--registry_store_timeout="100secs" --registry_strict="false" 
--root_submissions="true" --user_sorter="drf" --version="false" 
--webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/4RYdF1/master" 
--zk_session_timeout="10secs"
I1215 00:09:28.697413 30867 master.cpp:507] Master only allowing authenticated 
frameworks to register
I1215 00:09:28.697422 30867 master.cpp:513] Master only allowing authenticated 
agents to register
I1215 00:09:28.697427 30867 master.cpp:519] Master only allowing authenticated 
HTTP frameworks to register
I1215 00:09:28.697433 30867 credentials.hpp:37] Loading credentials for 
authentication from '/tmp/4RYdF1/credentials'
I1215 00:09:28.697654 30867 master.cpp:563] Using default 'crammd5' 
authenticator
I1215 00:09:28.697806 30867 http.cpp:1045] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-readonly'
I1215 00:09:28.697962 30867 http.cpp:1045] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-readwrite'
I1215 00:09:28.698076 30867 http.cpp:1045] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-scheduler'
I1215 00:09:28.698194 30867 master.cpp:642] Authorization enabled
I1215 00:09:28.698468 30864 hierarchical.cpp:175] Initialized hierarchical 
allocator process
I1215 00:09:28.698563 30864 whitelist_watcher.cpp:77] No whitelist given
I1215 00:09:28.701695 30871 master.cpp:2209] Elected as the leading master!
I1215 00:09:28.701723 30871 master.cpp:1689] Recovering from registrar
I1215 00:09:28.701859 30869 registrar.cpp:347] Recovering registrar
I1215 00:09:28.702401 30869 registrar.cpp:391] Successfully fetched the 
registry (0B) in 507904ns
I1215 00:09:28.702495 30869 registrar.cpp:495] Applied 1 operations in 28977ns; 
attempting to update the registry
I1215 00:09:28.702997 30869 registrar.cpp:552] Successfully updated the 
registry in 464896ns
I1215 00:09:28.703086 30869 registrar.cpp:424] Successfully recovered registrar
I1215 00:09:28.703640 30865 master.cpp:1802] Recovered 0 agents from the 
registry (167B); allowing 10mins for agents to re-register
I1215 00:09:28.703661 30869 hierarchical.cpp:213] Skipping recovery of 
hierarchical allocator: nothing to recover
W1215 00:09:28.706816 19343 process.cpp:2756] Attempted to spawn already 
running process files@127.0.1.1:35029
I1215 00:09:28.707818 19343 containerizer.cpp:304] Using isolation { 
environment_secret, volume/sandbox_path, volume/host_path, docker/runtime, 
filesystem/linux, network/cni, volume/image }
I1215 00:09:28.712363 19343 linux_launcher.cpp:146] Using 
/sys/fs/cgroup/freezer as the freezer hierarchy for the Linux launcher
sh: 1: hadoop: not found
I1215 00:09:28.740589 19343 fetcher.cpp:69] Skipping URI fetcher plugin 
'hadoop' as it could not be created: Failed to create HDFS client: Hadoop 
client is not available, exit status: 32512
I1215 00:09:28.740788 19343 registry_puller.cpp:129] Creating registry puller 
with docker registry 'https://registry-1.docker.io'
I1215 00:09:28.742266 19343 provisioner.cpp:299] Using default backend 'overlay'
I1215 00:09:28.745649 

[jira] [Updated] (MESOS-8335) ProvisionerDockerTest fails on Debian 9

2017-12-20 Thread Armand Grillet (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Armand Grillet updated MESOS-8335:
--
Description: 
Version of Docker used: Docker version 17.11.0-ce, build 1caf76c
Version of Curl used: curl 7.52.1 (x86_64-pc-linux-gnu) libcurl/7.52.1 
OpenSSL/1.0.2l zlib/1.2.8 libidn2/0.16 libpsl/0.17.0 (+libidn2/0.16) 
libssh2/1.7.0 nghttp2/1.18.1 librtmp/2.3

Error:
{code}
[ RUN  ] 
ImageAlpine/ProvisionerDockerTest.ROOT_INTERNET_CURL_SimpleCommand/2
I1215 00:09:28.694677 19343 cluster.cpp:172] Creating default 'local' authorizer
I1215 00:09:28.697144 30867 master.cpp:456] Master 
75b48a47-7b6b-4e60-82d3-dfdc0cf8bff3 (ip-172-16-10-160.ec2.internal) started on 
127.0.1.1:35029
I1215 00:09:28.697163 30867 master.cpp:458] Flags at startup: --acls="" 
--agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
--allocation_interval="1secs" --allocator="HierarchicalDRF" 
--authenticate_agents="true" --authenticate_frameworks="true" 
--authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
--authenticate_http_readwrite="true" --authenticators="crammd5" 
--authorizers="local" --credentials="/tmp/4RYdF1/credentials" 
--filter_gpu_resources="true" --framework_sorter="drf" --help="false" 
--hostname_lookup="true" --http_authenticators="basic" 
--http_framework_authenticators="basic" --initialize_driver_logging="true" 
--log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" 
--max_agent_ping_timeouts="5" --max_completed_frameworks="50" 
--max_completed_tasks_per_framework="1000" 
--max_unreachable_tasks_per_framework="1000" --port="5050" --quiet="false" 
--recovery_agent_removal_limit="100%" --registry="in_memory" 
--registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
--registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
--registry_store_timeout="100secs" --registry_strict="false" 
--root_submissions="true" --user_sorter="drf" --version="false" 
--webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/4RYdF1/master" 
--zk_session_timeout="10secs"
I1215 00:09:28.697413 30867 master.cpp:507] Master only allowing authenticated 
frameworks to register
I1215 00:09:28.697422 30867 master.cpp:513] Master only allowing authenticated 
agents to register
I1215 00:09:28.697427 30867 master.cpp:519] Master only allowing authenticated 
HTTP frameworks to register
I1215 00:09:28.697433 30867 credentials.hpp:37] Loading credentials for 
authentication from '/tmp/4RYdF1/credentials'
I1215 00:09:28.697654 30867 master.cpp:563] Using default 'crammd5' 
authenticator
I1215 00:09:28.697806 30867 http.cpp:1045] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-readonly'
I1215 00:09:28.697962 30867 http.cpp:1045] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-readwrite'
I1215 00:09:28.698076 30867 http.cpp:1045] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-scheduler'
I1215 00:09:28.698194 30867 master.cpp:642] Authorization enabled
I1215 00:09:28.698468 30864 hierarchical.cpp:175] Initialized hierarchical 
allocator process
I1215 00:09:28.698563 30864 whitelist_watcher.cpp:77] No whitelist given
I1215 00:09:28.701695 30871 master.cpp:2209] Elected as the leading master!
I1215 00:09:28.701723 30871 master.cpp:1689] Recovering from registrar
I1215 00:09:28.701859 30869 registrar.cpp:347] Recovering registrar
I1215 00:09:28.702401 30869 registrar.cpp:391] Successfully fetched the 
registry (0B) in 507904ns
I1215 00:09:28.702495 30869 registrar.cpp:495] Applied 1 operations in 28977ns; 
attempting to update the registry
I1215 00:09:28.702997 30869 registrar.cpp:552] Successfully updated the 
registry in 464896ns
I1215 00:09:28.703086 30869 registrar.cpp:424] Successfully recovered registrar
I1215 00:09:28.703640 30865 master.cpp:1802] Recovered 0 agents from the 
registry (167B); allowing 10mins for agents to re-register
I1215 00:09:28.703661 30869 hierarchical.cpp:213] Skipping recovery of 
hierarchical allocator: nothing to recover
W1215 00:09:28.706816 19343 process.cpp:2756] Attempted to spawn already 
running process files@127.0.1.1:35029
I1215 00:09:28.707818 19343 containerizer.cpp:304] Using isolation { 
environment_secret, volume/sandbox_path, volume/host_path, docker/runtime, 
filesystem/linux, network/cni, volume/image }
I1215 00:09:28.712363 19343 linux_launcher.cpp:146] Using 
/sys/fs/cgroup/freezer as the freezer hierarchy for the Linux launcher
sh: 1: hadoop: not found
I1215 00:09:28.740589 19343 fetcher.cpp:69] Skipping URI fetcher plugin 
'hadoop' as it could not be created: Failed to create HDFS client: Hadoop 
client is not available, exit status: 32512
I1215 00:09:28.740788 19343 registry_puller.cpp:129] Creating registry puller 
with docker registry 'https://registry-1.docker.io'
I1215 00:09:28.742266 19343 provisioner.cpp:299] Using default backend 'overlay'
I1215 00:09:28.745649 

[jira] [Updated] (MESOS-8346) Resubscription of a resource provider will crash the agent if its HTTP connection isn't closed

2017-12-20 Thread Jan Schlicht (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jan Schlicht updated MESOS-8346:

Shepherd: Benjamin Bannier

> Resubscription of a resource provider will crash the agent if its HTTP 
> connection isn't closed
> --
>
> Key: MESOS-8346
> URL: https://issues.apache.org/jira/browse/MESOS-8346
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.5.0
>Reporter: Jan Schlicht
>Assignee: Jan Schlicht
>Priority: Blocker
>  Labels: mesosphere
>
> A resource provider might resubscribe while its old HTTP connection wasn't 
> properly closed. In that case an agent will crashm with, e.g., the following 
> log:
> {noformat}
> I1219 13:33:51.937295 128610304 manager.cpp:570] Subscribing resource 
> provider 
> {"id":{"value":"8e71beef-796e-4bde-9257-952ed0f230a5"},"name":"test","type":"org.apache.mesos.rp.test"}
> I1219 13:33:51.937443 128610304 manager.cpp:134] Terminating resource 
> provider 8e71beef-796e-4bde-9257-952ed0f230a5
> I1219 13:33:51.937760 128610304 manager.cpp:134] Terminating resource 
> provider 8e71beef-796e-4bde-9257-952ed0f230a5
> E1219 13:33:51.937851 129683456 http_connection.hpp:445] End-Of-File received
> I1219 13:33:51.937865 131293184 slave.cpp:7105] Handling resource provider 
> message 'DISCONNECT: resource provider 8e71beef-796e-4bde-9257-952ed0f230a5'
> I1219 13:33:51.937968 131293184 slave.cpp:7347] Forwarding new total 
> resources cpus:2; mem:1024; disk:1024; ports:[31000-32000]
> F1219 13:33:51.938052 132366336 manager.cpp:606] Check failed: 
> resourceProviders.subscribed.contains(resourceProviderId) 
> *** Check failure stack trace: ***
> E1219 13:33:51.938583 130756608 http_connection.hpp:445] End-Of-File received
> I1219 13:33:51.938987 129683456 hierarchical.cpp:669] Agent 
> 0019c3fa-28c5-43a9-88d0-709eee271c62-S0 (172.18.8.13) updated with total 
> resources cpus:2; mem:1024; disk:1024; ports:[31000-32000]
> @0x1125380ef  google::LogMessageFatal::~LogMessageFatal()
> @0x112534ae9  google::LogMessageFatal::~LogMessageFatal()
> I1219 13:33:51.939131 129683456 hierarchical.cpp:1517] Performed allocation 
> for 1 agents in 61830ns
> I1219 13:33:51.945793 2646795072 slave.cpp:927] Agent terminating
> I1219 13:33:51.945955 129146880 master.cpp:1305] Agent 
> 0019c3fa-28c5-43a9-88d0-709eee271c62-S0 at slave(1)@172.18.8.13:64430 
> (172.18.8.13) disconnected
> I1219 13:33:51.945979 129146880 master.cpp:3364] Disconnecting agent 
> 0019c3fa-28c5-43a9-88d0-709eee271c62-S0 at slave(1)@172.18.8.13:64430 
> (172.18.8.13)
> I1219 13:33:51.946022 129146880 master.cpp:3383] Deactivating agent 
> 0019c3fa-28c5-43a9-88d0-709eee271c62-S0 at slave(1)@172.18.8.13:64430 
> (172.18.8.13)
> I1219 13:33:51.946081 131293184 hierarchical.cpp:766] Agent 
> 0019c3fa-28c5-43a9-88d0-709eee271c62-S0 deactivated
> @0x115f2761d  
> mesos::internal::ResourceProviderManagerProcess::subscribe()::$_2::operator()()
> @0x115f2977d  
> _ZN5cpp176invokeIZN5mesos8internal30ResourceProviderManagerProcess9subscribeERKNS2_14HttpConnectionERKNS1_17resource_provider14Call_SubscribeEE3$_2JN7process6FutureI7NothingEDTclclsr3stdE7forwardIT_Efp_Espclsr3stdE7forwardIT0_Efp0_EEEOSG_DpOSH_
> @0x115f29740  
> _ZN6lambda8internal7PartialIZN5mesos8internal30ResourceProviderManagerProcess9subscribeERKNS3_14HttpConnectionERKNS2_17resource_provider14Call_SubscribeEE3$_2JN7process6FutureI7Nothing13invoke_expandISC_NSt3__15tupleIJSG_EEENSK_IJEEEJLm0DTclsr5cpp17E6invokeclsr3stdE7forwardIT_Efp_Espcl6expandclsr3stdE3getIXT2_EEclsr3stdE7forwardIT0_Efp0_EEclsr3stdE7forwardIT1_Efp2_OSN_OSO_N5cpp1416integer_sequenceImJXspT2_OSP_
> @0x115f296bb  
> _ZNO6lambda8internal7PartialIZN5mesos8internal30ResourceProviderManagerProcess9subscribeERKNS3_14HttpConnectionERKNS2_17resource_provider14Call_SubscribeEE3$_2JN7process6FutureI7NothingclIJEEEDTcl13invoke_expandclL_ZNSt3__14moveIRSC_EEONSJ_16remove_referenceIT_E4typeEOSN_EdtdefpT1fEclL_ZNSK_IRNSJ_5tupleIJSG_ESQ_SR_EdtdefpT10bound_argsEcvN5cpp1416integer_sequenceImJLm0_Eclsr3stdE16forward_as_tuplespclsr3stdE7forwardIT_Efp_DpOSY_
> @0x115f2965d  
> _ZN5cpp176invokeIN6lambda8internal7PartialIZN5mesos8internal30ResourceProviderManagerProcess9subscribeERKNS5_14HttpConnectionERKNS4_17resource_provider14Call_SubscribeEE3$_2JN7process6FutureI7NothingEJEEEDTclclsr3stdE7forwardIT_Efp_Espclsr3stdE7forwardIT0_Efp0_EEEOSK_DpOSL_
> @0x115f29631  
> _ZN6lambda8internal6InvokeIvEclINS0_7PartialIZN5mesos8internal30ResourceProviderManagerProcess9subscribeERKNS6_14HttpConnectionERKNS5_17resource_provider14Call_SubscribeEE3$_2JN7process6FutureI7NothingEJEEEvOT_DpOT0_
> @

[jira] [Assigned] (MESOS-8350) Resource provider-capable agents not correctly synchronizing checkpointed agent resources on reregistration

2017-12-20 Thread Benjamin Bannier (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Bannier reassigned MESOS-8350:
---

Assignee: Benjamin Bannier

> Resource provider-capable agents not correctly synchronizing checkpointed 
> agent resources on reregistration
> ---
>
> Key: MESOS-8350
> URL: https://issues.apache.org/jira/browse/MESOS-8350
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Reporter: Benjamin Bannier
>Assignee: Benjamin Bannier
>Priority: Critical
>
> For resource provider-capable agents the master does not re-send checkpointed 
> resources on agent reregistration; instead the checkpointed resources sent as 
> part of the {{ReregisterSlaveMessage}} should be used.
> This is not what happens in reality. If e.g., checkpointing of an offer 
> operation fails and the agent fails over the checkpointed resources would, as 
> expected, not be reflected in the agent, but would still be assumed in the 
> master.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8350) Resource provider-capable agents not correctly synchronizing checkpointed agent resources on reregistration

2017-12-20 Thread Benjamin Bannier (JIRA)
Benjamin Bannier created MESOS-8350:
---

 Summary: Resource provider-capable agents not correctly 
synchronizing checkpointed agent resources on reregistration
 Key: MESOS-8350
 URL: https://issues.apache.org/jira/browse/MESOS-8350
 Project: Mesos
  Issue Type: Bug
  Components: master
Reporter: Benjamin Bannier
Priority: Critical


For resource provider-capable agents the master does not re-send checkpointed 
resources on agent reregistration; instead the checkpointed resources sent as 
part of the {{ReregisterSlaveMessage}} should be used.

This is not what happens in reality. If e.g., checkpointing of an offer 
operation fails and the agent fails over the checkpointed resources would, as 
expected, not be reflected in the agent, but would still be assumed in the 
master.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-8349) When a resource provider driver is disconnected, it fails to reconnect.

2017-12-20 Thread Jan Schlicht (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16298557#comment-16298557
 ] 

Jan Schlicht commented on MESOS-8349:
-

Discarding a {{Future}} (instead of discarding its {{Promise}}) won't call 
{{onAny}} callbacks, only a {{onDiscarded}} callback that we haven't set up 
here.

> When a resource provider driver is disconnected, it fails to reconnect.
> ---
>
> Key: MESOS-8349
> URL: https://issues.apache.org/jira/browse/MESOS-8349
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.5.0
>Reporter: Jan Schlicht
>Assignee: Jan Schlicht
>  Labels: mesosphere
>
> If the resource provider manager closes the HTTP connection of a resource 
> provider, the resource provider should reconnect itself. For that, the 
> resource provider driver will change its state to "DISCONNECTED", call a 
> {{disconnected}} callback and use its endpoint detector to reconnect.
> This doesn't work in a testing environment where a 
> {{ConstantEndpointDetector}} is used. While the resource provider is notified 
> of the closed HTTP connection (and logs {{End-Of-File received}}), it never 
> disconnects itself and calls the {{disconnected}} callback. Discarding 
> {{HttpConnectionProcess::detection}} in 
> {{HttpConnectionProcess::disconnected}} doesn't trigger the {{onAny}} 
> callback of that future. This might not be a problem in 
> {{HttpConnectionProcess}} but could be related to the test case using a 
> {{ConstantEndpointDetector}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (MESOS-8267) NestedMesosContainerizerTest.ROOT_CGROUPS_RecoverLauncherOrphans is flaky.

2017-12-20 Thread Alexander Rukletsov (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16297244#comment-16297244
 ] 

Alexander Rukletsov edited comment on MESOS-8267 at 12/20/17 2:03 PM:
--

{noformat}
Commit: 0cc636b2d5ad5c934b8b7f350bc8c99b9282b5ab [0cc636b]
Author: Andrei Budnik 
Date: 19 December 2017 at 19:45:33 GMT+1
Committer: Alexander Rukletsov 

Fixed flaky `ROOT_CGROUPS_RecoverLauncherOrphans` test.

Containerizer recovery returns control to the caller before completion
of destruction of orphaned containers. Previously, `wait` was called on
a container right after calling `recover`, so `wait` was almost always
successfull, because destruction of the orphaned container takes some
time to complete.

This patch replaces check for the container existence with the check
that a related freezer cgroup has been destroyed. The freezer cgroup
is destroyed during container destruction initiated by a containerizer
recovery process.

Review: https://reviews.apache.org/r/64680/
{noformat}
{noformat}
Commit: 196fe20861a4efc8edcabb08ce8821e8ed1b8f02 [196fe20]
Author: Andrei Budnik 
Date: 20 December 2017 at 14:59:50 GMT+1
Committer: Alexander Rukletsov 

Fixed flaky `NestedMesosContainerizerTest` tests.

This patch is an addition to commit 0cc636b2d5.

Review: https://reviews.apache.org/r/64749/
{noformat}


was (Author: alexr):
{noformat}
Commit: 0cc636b2d5ad5c934b8b7f350bc8c99b9282b5ab [0cc636b]
Author: Andrei Budnik 
Date: 19 December 2017 at 19:45:33 GMT+1
Committer: Alexander Rukletsov 

Fixed flaky `ROOT_CGROUPS_RecoverLauncherOrphans` test.

Containerizer recovery returns control to the caller before completion
of destruction of orphaned containers. Previously, `wait` was called on
a container right after calling `recover`, so `wait` was almost always
successfull, because destruction of the orphaned container takes some
time to complete.

This patch replaces check for the container existence with the check
that a related freezer cgroup has been destroyed. The freezer cgroup
is destroyed during container destruction initiated by a containerizer
recovery process.

Review: https://reviews.apache.org/r/64680/
{noformat}

> NestedMesosContainerizerTest.ROOT_CGROUPS_RecoverLauncherOrphans is flaky.
> --
>
> Key: MESOS-8267
> URL: https://issues.apache.org/jira/browse/MESOS-8267
> Project: Mesos
>  Issue Type: Bug
>  Components: test
> Environment: Ubuntu 16.04
>Reporter: Alexander Rukletsov
>Assignee: Andrei Budnik
>  Labels: flaky-test
> Fix For: 1.5.0
>
> Attachments: ROOT_CGROUPS_RecoverLauncherOrphans-badrun.txt
>
>
> {noformat}
> ../../src/tests/containerizer/nested_mesos_containerizer_tests.cpp:1902
> wait.get() is NONE
> {noformat}
> Full log attached.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8349) When a resource provider driver is disconnected, it fails to reconnect.

2017-12-20 Thread Jan Schlicht (JIRA)
Jan Schlicht created MESOS-8349:
---

 Summary: When a resource provider driver is disconnected, it fails 
to reconnect.
 Key: MESOS-8349
 URL: https://issues.apache.org/jira/browse/MESOS-8349
 Project: Mesos
  Issue Type: Bug
Affects Versions: 1.5.0
Reporter: Jan Schlicht
Assignee: Jan Schlicht


If the resource provider manager closes the HTTP connection of a resource 
provider, the resource provider should reconnect itself. For that, the resource 
provider driver will change its state to "DISCONNECTED", call a 
{{disconnected}} callback and use its endpoint detector to reconnect.
This doesn't work in a testing environment where a {{ConstantEndpointDetector}} 
is used. While the resource provider is notified of the closed HTTP connection 
(and logs {{End-Of-File received}}), it never disconnects itself and calls the 
{{disconnected}} callback. Discarding {{HttpConnectionProcess::detection}} in 
{{HttpConnectionProcess::disconnected}} doesn't trigger the {{onAny}} callback 
of that future. This might not be a problem in {{HttpConnectionProcess}} but 
could be related to the test case using a {{ConstantEndpointDetector}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8335) ProvisionerDockerTest fails on Debian 9

2017-12-20 Thread Armand Grillet (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Armand Grillet updated MESOS-8335:
--
Description: 
Version of Docker used: Docker version 17.11.0-ce, build 1caf76c
Version of Curl used: curl 7.52.1 (x86_64-pc-linux-gnu) libcurl/7.52.1 
OpenSSL/1.0.2l zlib/1.2.8 libidn2/0.16 libpsl/0.17.0 (+libidn2/0.16) 
libssh2/1.7.0 nghttp2/1.18.1 librtmp/2.3

Error:
{code}
[ RUN  ] 
ImageAlpine/ProvisionerDockerTest.ROOT_INTERNET_CURL_SimpleCommand/2
I1215 00:09:28.694677 19343 cluster.cpp:172] Creating default 'local' authorizer
I1215 00:09:28.697144 30867 master.cpp:456] Master 
75b48a47-7b6b-4e60-82d3-dfdc0cf8bff3 (ip-172-16-10-160.ec2.internal) started on 
127.0.1.1:35029
I1215 00:09:28.697163 30867 master.cpp:458] Flags at startup: --acls="" 
--agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
--allocation_interval="1secs" --allocator="HierarchicalDRF" 
--authenticate_agents="true" --authenticate_frameworks="true" 
--authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
--authenticate_http_readwrite="true" --authenticators="crammd5" 
--authorizers="local" --credentials="/tmp/4RYdF1/credentials" 
--filter_gpu_resources="true" --framework_sorter="drf" --help="false" 
--hostname_lookup="true" --http_authenticators="basic" 
--http_framework_authenticators="basic" --initialize_driver_logging="true" 
--log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" 
--max_agent_ping_timeouts="5" --max_completed_frameworks="50" 
--max_completed_tasks_per_framework="1000" 
--max_unreachable_tasks_per_framework="1000" --port="5050" --quiet="false" 
--recovery_agent_removal_limit="100%" --registry="in_memory" 
--registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
--registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
--registry_store_timeout="100secs" --registry_strict="false" 
--root_submissions="true" --user_sorter="drf" --version="false" 
--webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/4RYdF1/master" 
--zk_session_timeout="10secs"
I1215 00:09:28.697413 30867 master.cpp:507] Master only allowing authenticated 
frameworks to register
I1215 00:09:28.697422 30867 master.cpp:513] Master only allowing authenticated 
agents to register
I1215 00:09:28.697427 30867 master.cpp:519] Master only allowing authenticated 
HTTP frameworks to register
I1215 00:09:28.697433 30867 credentials.hpp:37] Loading credentials for 
authentication from '/tmp/4RYdF1/credentials'
I1215 00:09:28.697654 30867 master.cpp:563] Using default 'crammd5' 
authenticator
I1215 00:09:28.697806 30867 http.cpp:1045] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-readonly'
I1215 00:09:28.697962 30867 http.cpp:1045] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-readwrite'
I1215 00:09:28.698076 30867 http.cpp:1045] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-scheduler'
I1215 00:09:28.698194 30867 master.cpp:642] Authorization enabled
I1215 00:09:28.698468 30864 hierarchical.cpp:175] Initialized hierarchical 
allocator process
I1215 00:09:28.698563 30864 whitelist_watcher.cpp:77] No whitelist given
I1215 00:09:28.701695 30871 master.cpp:2209] Elected as the leading master!
I1215 00:09:28.701723 30871 master.cpp:1689] Recovering from registrar
I1215 00:09:28.701859 30869 registrar.cpp:347] Recovering registrar
I1215 00:09:28.702401 30869 registrar.cpp:391] Successfully fetched the 
registry (0B) in 507904ns
I1215 00:09:28.702495 30869 registrar.cpp:495] Applied 1 operations in 28977ns; 
attempting to update the registry
I1215 00:09:28.702997 30869 registrar.cpp:552] Successfully updated the 
registry in 464896ns
I1215 00:09:28.703086 30869 registrar.cpp:424] Successfully recovered registrar
I1215 00:09:28.703640 30865 master.cpp:1802] Recovered 0 agents from the 
registry (167B); allowing 10mins for agents to re-register
I1215 00:09:28.703661 30869 hierarchical.cpp:213] Skipping recovery of 
hierarchical allocator: nothing to recover
W1215 00:09:28.706816 19343 process.cpp:2756] Attempted to spawn already 
running process files@127.0.1.1:35029
I1215 00:09:28.707818 19343 containerizer.cpp:304] Using isolation { 
environment_secret, volume/sandbox_path, volume/host_path, docker/runtime, 
filesystem/linux, network/cni, volume/image }
I1215 00:09:28.712363 19343 linux_launcher.cpp:146] Using 
/sys/fs/cgroup/freezer as the freezer hierarchy for the Linux launcher
sh: 1: hadoop: not found
I1215 00:09:28.740589 19343 fetcher.cpp:69] Skipping URI fetcher plugin 
'hadoop' as it could not be created: Failed to create HDFS client: Hadoop 
client is not available, exit status: 32512
I1215 00:09:28.740788 19343 registry_puller.cpp:129] Creating registry puller 
with docker registry 'https://registry-1.docker.io'
I1215 00:09:28.742266 19343 provisioner.cpp:299] Using default backend 'overlay'
I1215 00:09:28.745649 

[jira] [Commented] (MESOS-8346) Resubscription of a resource provider will crash the agent if its HTTP connection isn't closed

2017-12-20 Thread Jan Schlicht (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16298138#comment-16298138
 ] 

Jan Schlicht commented on MESOS-8346:
-

It will land today, the patch seems to be good, just needs a small update.

> Resubscription of a resource provider will crash the agent if its HTTP 
> connection isn't closed
> --
>
> Key: MESOS-8346
> URL: https://issues.apache.org/jira/browse/MESOS-8346
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.5.0
>Reporter: Jan Schlicht
>Assignee: Jan Schlicht
>Priority: Blocker
>  Labels: mesosphere
>
> A resource provider might resubscribe while its old HTTP connection wasn't 
> properly closed. In that case an agent will crashm with, e.g., the following 
> log:
> {noformat}
> I1219 13:33:51.937295 128610304 manager.cpp:570] Subscribing resource 
> provider 
> {"id":{"value":"8e71beef-796e-4bde-9257-952ed0f230a5"},"name":"test","type":"org.apache.mesos.rp.test"}
> I1219 13:33:51.937443 128610304 manager.cpp:134] Terminating resource 
> provider 8e71beef-796e-4bde-9257-952ed0f230a5
> I1219 13:33:51.937760 128610304 manager.cpp:134] Terminating resource 
> provider 8e71beef-796e-4bde-9257-952ed0f230a5
> E1219 13:33:51.937851 129683456 http_connection.hpp:445] End-Of-File received
> I1219 13:33:51.937865 131293184 slave.cpp:7105] Handling resource provider 
> message 'DISCONNECT: resource provider 8e71beef-796e-4bde-9257-952ed0f230a5'
> I1219 13:33:51.937968 131293184 slave.cpp:7347] Forwarding new total 
> resources cpus:2; mem:1024; disk:1024; ports:[31000-32000]
> F1219 13:33:51.938052 132366336 manager.cpp:606] Check failed: 
> resourceProviders.subscribed.contains(resourceProviderId) 
> *** Check failure stack trace: ***
> E1219 13:33:51.938583 130756608 http_connection.hpp:445] End-Of-File received
> I1219 13:33:51.938987 129683456 hierarchical.cpp:669] Agent 
> 0019c3fa-28c5-43a9-88d0-709eee271c62-S0 (172.18.8.13) updated with total 
> resources cpus:2; mem:1024; disk:1024; ports:[31000-32000]
> @0x1125380ef  google::LogMessageFatal::~LogMessageFatal()
> @0x112534ae9  google::LogMessageFatal::~LogMessageFatal()
> I1219 13:33:51.939131 129683456 hierarchical.cpp:1517] Performed allocation 
> for 1 agents in 61830ns
> I1219 13:33:51.945793 2646795072 slave.cpp:927] Agent terminating
> I1219 13:33:51.945955 129146880 master.cpp:1305] Agent 
> 0019c3fa-28c5-43a9-88d0-709eee271c62-S0 at slave(1)@172.18.8.13:64430 
> (172.18.8.13) disconnected
> I1219 13:33:51.945979 129146880 master.cpp:3364] Disconnecting agent 
> 0019c3fa-28c5-43a9-88d0-709eee271c62-S0 at slave(1)@172.18.8.13:64430 
> (172.18.8.13)
> I1219 13:33:51.946022 129146880 master.cpp:3383] Deactivating agent 
> 0019c3fa-28c5-43a9-88d0-709eee271c62-S0 at slave(1)@172.18.8.13:64430 
> (172.18.8.13)
> I1219 13:33:51.946081 131293184 hierarchical.cpp:766] Agent 
> 0019c3fa-28c5-43a9-88d0-709eee271c62-S0 deactivated
> @0x115f2761d  
> mesos::internal::ResourceProviderManagerProcess::subscribe()::$_2::operator()()
> @0x115f2977d  
> _ZN5cpp176invokeIZN5mesos8internal30ResourceProviderManagerProcess9subscribeERKNS2_14HttpConnectionERKNS1_17resource_provider14Call_SubscribeEE3$_2JN7process6FutureI7NothingEDTclclsr3stdE7forwardIT_Efp_Espclsr3stdE7forwardIT0_Efp0_EEEOSG_DpOSH_
> @0x115f29740  
> _ZN6lambda8internal7PartialIZN5mesos8internal30ResourceProviderManagerProcess9subscribeERKNS3_14HttpConnectionERKNS2_17resource_provider14Call_SubscribeEE3$_2JN7process6FutureI7Nothing13invoke_expandISC_NSt3__15tupleIJSG_EEENSK_IJEEEJLm0DTclsr5cpp17E6invokeclsr3stdE7forwardIT_Efp_Espcl6expandclsr3stdE3getIXT2_EEclsr3stdE7forwardIT0_Efp0_EEclsr3stdE7forwardIT1_Efp2_OSN_OSO_N5cpp1416integer_sequenceImJXspT2_OSP_
> @0x115f296bb  
> _ZNO6lambda8internal7PartialIZN5mesos8internal30ResourceProviderManagerProcess9subscribeERKNS3_14HttpConnectionERKNS2_17resource_provider14Call_SubscribeEE3$_2JN7process6FutureI7NothingclIJEEEDTcl13invoke_expandclL_ZNSt3__14moveIRSC_EEONSJ_16remove_referenceIT_E4typeEOSN_EdtdefpT1fEclL_ZNSK_IRNSJ_5tupleIJSG_ESQ_SR_EdtdefpT10bound_argsEcvN5cpp1416integer_sequenceImJLm0_Eclsr3stdE16forward_as_tuplespclsr3stdE7forwardIT_Efp_DpOSY_
> @0x115f2965d  
> _ZN5cpp176invokeIN6lambda8internal7PartialIZN5mesos8internal30ResourceProviderManagerProcess9subscribeERKNS5_14HttpConnectionERKNS4_17resource_provider14Call_SubscribeEE3$_2JN7process6FutureI7NothingEJEEEDTclclsr3stdE7forwardIT_Efp_Espclsr3stdE7forwardIT0_Efp0_EEEOSK_DpOSL_
> @0x115f29631  
> 

[jira] [Commented] (MESOS-6623) Re-enable tests impacted by request streaming support

2017-12-20 Thread Gilbert Song (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16298114#comment-16298114
 ] 

Gilbert Song commented on MESOS-6623:
-

Seems like this is targeted for 1.5.0. Do we have an estimate when it will land?

/cc [~anandmazumdar]

> Re-enable tests impacted by request streaming support
> -
>
> Key: MESOS-6623
> URL: https://issues.apache.org/jira/browse/MESOS-6623
> Project: Mesos
>  Issue Type: Bug
>  Components: HTTP API, test
>Reporter: Anand Mazumdar
>Assignee: Anand Mazumdar
>Priority: Critical
>  Labels: mesosphere
>
> We added support for HTTP request streaming in libprocess as part of 
> MESOS-6466. However, this broke a few tests that relied on HTTP request 
> filtering since the handlers no longer have access to the body of the request 
> when {{visit()}} is invoked. We would need to revisit how we do HTTP request 
> filtering and then re-enable these tests.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-8346) Resubscription of a resource provider will crash the agent if its HTTP connection isn't closed

2017-12-20 Thread Gilbert Song (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16298109#comment-16298109
 ] 

Gilbert Song commented on MESOS-8346:
-

[~nfnt], seems like this is targeted for 1.5.0. Do we have an estimate when it 
will land?

> Resubscription of a resource provider will crash the agent if its HTTP 
> connection isn't closed
> --
>
> Key: MESOS-8346
> URL: https://issues.apache.org/jira/browse/MESOS-8346
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.5.0
>Reporter: Jan Schlicht
>Assignee: Jan Schlicht
>Priority: Blocker
>  Labels: mesosphere
>
> A resource provider might resubscribe while its old HTTP connection wasn't 
> properly closed. In that case an agent will crashm with, e.g., the following 
> log:
> {noformat}
> I1219 13:33:51.937295 128610304 manager.cpp:570] Subscribing resource 
> provider 
> {"id":{"value":"8e71beef-796e-4bde-9257-952ed0f230a5"},"name":"test","type":"org.apache.mesos.rp.test"}
> I1219 13:33:51.937443 128610304 manager.cpp:134] Terminating resource 
> provider 8e71beef-796e-4bde-9257-952ed0f230a5
> I1219 13:33:51.937760 128610304 manager.cpp:134] Terminating resource 
> provider 8e71beef-796e-4bde-9257-952ed0f230a5
> E1219 13:33:51.937851 129683456 http_connection.hpp:445] End-Of-File received
> I1219 13:33:51.937865 131293184 slave.cpp:7105] Handling resource provider 
> message 'DISCONNECT: resource provider 8e71beef-796e-4bde-9257-952ed0f230a5'
> I1219 13:33:51.937968 131293184 slave.cpp:7347] Forwarding new total 
> resources cpus:2; mem:1024; disk:1024; ports:[31000-32000]
> F1219 13:33:51.938052 132366336 manager.cpp:606] Check failed: 
> resourceProviders.subscribed.contains(resourceProviderId) 
> *** Check failure stack trace: ***
> E1219 13:33:51.938583 130756608 http_connection.hpp:445] End-Of-File received
> I1219 13:33:51.938987 129683456 hierarchical.cpp:669] Agent 
> 0019c3fa-28c5-43a9-88d0-709eee271c62-S0 (172.18.8.13) updated with total 
> resources cpus:2; mem:1024; disk:1024; ports:[31000-32000]
> @0x1125380ef  google::LogMessageFatal::~LogMessageFatal()
> @0x112534ae9  google::LogMessageFatal::~LogMessageFatal()
> I1219 13:33:51.939131 129683456 hierarchical.cpp:1517] Performed allocation 
> for 1 agents in 61830ns
> I1219 13:33:51.945793 2646795072 slave.cpp:927] Agent terminating
> I1219 13:33:51.945955 129146880 master.cpp:1305] Agent 
> 0019c3fa-28c5-43a9-88d0-709eee271c62-S0 at slave(1)@172.18.8.13:64430 
> (172.18.8.13) disconnected
> I1219 13:33:51.945979 129146880 master.cpp:3364] Disconnecting agent 
> 0019c3fa-28c5-43a9-88d0-709eee271c62-S0 at slave(1)@172.18.8.13:64430 
> (172.18.8.13)
> I1219 13:33:51.946022 129146880 master.cpp:3383] Deactivating agent 
> 0019c3fa-28c5-43a9-88d0-709eee271c62-S0 at slave(1)@172.18.8.13:64430 
> (172.18.8.13)
> I1219 13:33:51.946081 131293184 hierarchical.cpp:766] Agent 
> 0019c3fa-28c5-43a9-88d0-709eee271c62-S0 deactivated
> @0x115f2761d  
> mesos::internal::ResourceProviderManagerProcess::subscribe()::$_2::operator()()
> @0x115f2977d  
> _ZN5cpp176invokeIZN5mesos8internal30ResourceProviderManagerProcess9subscribeERKNS2_14HttpConnectionERKNS1_17resource_provider14Call_SubscribeEE3$_2JN7process6FutureI7NothingEDTclclsr3stdE7forwardIT_Efp_Espclsr3stdE7forwardIT0_Efp0_EEEOSG_DpOSH_
> @0x115f29740  
> _ZN6lambda8internal7PartialIZN5mesos8internal30ResourceProviderManagerProcess9subscribeERKNS3_14HttpConnectionERKNS2_17resource_provider14Call_SubscribeEE3$_2JN7process6FutureI7Nothing13invoke_expandISC_NSt3__15tupleIJSG_EEENSK_IJEEEJLm0DTclsr5cpp17E6invokeclsr3stdE7forwardIT_Efp_Espcl6expandclsr3stdE3getIXT2_EEclsr3stdE7forwardIT0_Efp0_EEclsr3stdE7forwardIT1_Efp2_OSN_OSO_N5cpp1416integer_sequenceImJXspT2_OSP_
> @0x115f296bb  
> _ZNO6lambda8internal7PartialIZN5mesos8internal30ResourceProviderManagerProcess9subscribeERKNS3_14HttpConnectionERKNS2_17resource_provider14Call_SubscribeEE3$_2JN7process6FutureI7NothingclIJEEEDTcl13invoke_expandclL_ZNSt3__14moveIRSC_EEONSJ_16remove_referenceIT_E4typeEOSN_EdtdefpT1fEclL_ZNSK_IRNSJ_5tupleIJSG_ESQ_SR_EdtdefpT10bound_argsEcvN5cpp1416integer_sequenceImJLm0_Eclsr3stdE16forward_as_tuplespclsr3stdE7forwardIT_Efp_DpOSY_
> @0x115f2965d  
> _ZN5cpp176invokeIN6lambda8internal7PartialIZN5mesos8internal30ResourceProviderManagerProcess9subscribeERKNS5_14HttpConnectionERKNS4_17resource_provider14Call_SubscribeEE3$_2JN7process6FutureI7NothingEJEEEDTclclsr3stdE7forwardIT_Efp_Espclsr3stdE7forwardIT0_Efp0_EEEOSK_DpOSL_
> @0x115f29631  
> 

[jira] [Commented] (MESOS-8297) Built-in driver-based executors ignore kill task if the task has not been launched.

2017-12-20 Thread Gilbert Song (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16298108#comment-16298108
 ] 

Gilbert Song commented on MESOS-8297:
-

Seems like this is targeted for 1.5.0. Do we have an estimate when it will land?

/cc [~alexr]

> Built-in driver-based executors ignore kill task if the task has not been 
> launched.
> ---
>
> Key: MESOS-8297
> URL: https://issues.apache.org/jira/browse/MESOS-8297
> Project: Mesos
>  Issue Type: Bug
>  Components: executor
>Reporter: Alexander Rukletsov
>Assignee: Alexander Rukletsov
>Priority: Blocker
>  Labels: mesosphere
>
> If docker executor receives a kill task request and the task has never been 
> launch, the request is ignored. We now know that: the executor has never 
> received the registration confirmation, hence has ignored the launch task 
> request, hence the task has never started. And this is how the executor 
> enters an idle state, waiting for registration and ignoring kill task 
> requests.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)