[jira] [Comment Edited] (MESOS-6002) The whiteout file cannot be removed correctly using aufs backend.
[ https://issues.apache.org/jira/browse/MESOS-6002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15504388#comment-15504388 ] Stéphane Cottin edited comment on MESOS-6002 at 9/20/16 6:19 AM: - Same issue using overlayfs : {code} Failed to remove whiteout file '/mnt/mesos/provisioner/containers/001bbc00-e460-4c15-a445-e3dd44f3dd8c/backends/overlay/rootfses/acb9cba3-671d-41a8-ad73-9b160f3ca048/var/lib/apt/lists/partial/.wh..opq': No such file or directory {code} Can be reproduced with official postgres, rabbitmq and many more docker images, all deleting the same folder in multiple RUN calls. update: forgot to mention, patched with https://reviews.apache.org/r/51124/ was (Author: kaalh): Same issue using overlayfs : {code} Failed to remove whiteout file '/mnt/mesos/provisioner/containers/001bbc00-e460-4c15-a445-e3dd44f3dd8c/backends/overlay/rootfses/acb9cba3-671d-41a8-ad73-9b160f3ca048/var/lib/apt/lists/partial/.wh..opq': No such file or directory {code} Can be reproduced with official postgres, rabbitmq and many more docker images, all deleting the same folder in multiple RUN calls. > The whiteout file cannot be removed correctly using aufs backend. > - > > Key: MESOS-6002 > URL: https://issues.apache.org/jira/browse/MESOS-6002 > Project: Mesos > Issue Type: Bug > Components: containerization > Environment: Ubuntu 14, Ubuntu 12 > Or any os with aufs module >Reporter: Gilbert Song > Labels: aufs, backend, containerizer > > The whiteout file is not removed correctly when using the aufs backend in > unified containerizer. It can be verified by this unit test with the aufs > manually specified. > {noformat} > [20:11:24] : [Step 10/10] [ RUN ] > ProvisionerDockerPullerTest.ROOT_INTERNET_CURL_Whiteout > [20:11:24]W: [Step 10/10] I0805 20:11:24.986734 24295 cluster.cpp:155] > Creating default 'local' authorizer > [20:11:25]W: [Step 10/10] I0805 20:11:25.001153 24295 leveldb.cpp:174] > Opened db in 14.308627ms > [20:11:25]W: [Step 10/10] I0805 20:11:25.003731 24295 leveldb.cpp:181] > Compacted db in 2.558329ms > [20:11:25]W: [Step 10/10] I0805 20:11:25.003749 24295 leveldb.cpp:196] > Created db iterator in 3086ns > [20:11:25]W: [Step 10/10] I0805 20:11:25.003754 24295 leveldb.cpp:202] > Seeked to beginning of db in 595ns > [20:11:25]W: [Step 10/10] I0805 20:11:25.003758 24295 leveldb.cpp:271] > Iterated through 0 keys in the db in 314ns > [20:11:25]W: [Step 10/10] I0805 20:11:25.003769 24295 replica.cpp:776] > Replica recovered with log positions 0 -> 0 with 1 holes and 0 unlearned > [20:11:25]W: [Step 10/10] I0805 20:11:25.004086 24315 recover.cpp:451] > Starting replica recovery > [20:11:25]W: [Step 10/10] I0805 20:11:25.004251 24312 recover.cpp:477] > Replica is in EMPTY status > [20:11:25]W: [Step 10/10] I0805 20:11:25.004546 24314 replica.cpp:673] > Replica in EMPTY status received a broadcasted recover request from > __req_res__(5640)@172.30.2.105:36006 > [20:11:25]W: [Step 10/10] I0805 20:11:25.004607 24312 recover.cpp:197] > Received a recover response from a replica in EMPTY status > [20:11:25]W: [Step 10/10] I0805 20:11:25.004762 24313 recover.cpp:568] > Updating replica status to STARTING > [20:11:25]W: [Step 10/10] I0805 20:11:25.004776 24314 master.cpp:375] > Master 21665992-d47e-402f-a00c-6f8fab613019 (ip-172-30-2-105.mesosphere.io) > started on 172.30.2.105:36006 > [20:11:25]W: [Step 10/10] I0805 20:11:25.004787 24314 master.cpp:377] Flags > at startup: --acls="" --agent_ping_timeout="15secs" > --agent_reregister_timeout="10mins" --allocation_interval="1secs" > --allocator="HierarchicalDRF" --authenticate_agents="true" > --authenticate_frameworks="true" --authenticate_http_frameworks="true" > --authenticate_http_readonly="true" --authenticate_http_readwrite="true" > --authenticators="crammd5" --authorizers="local" > --credentials="/tmp/0z753P/credentials" --framework_sorter="drf" > --help="false" --hostname_lookup="true" --http_authenticators="basic" > --http_framework_authenticators="basic" --initialize_driver_logging="true" > --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" > --max_agent_ping_timeouts="5" --max_completed_frameworks="50" > --max_completed_tasks_per_framework="1000" --quiet="false" > --recovery_agent_removal_limit="100%" --registry="replicated_log" > --registry_fetch_timeout="1mins" --registry_store_timeout="100secs" > --registry_strict="true" --root_submissions="true" --user_sorter="drf" > --version="false" --webui_dir="/usr/local/share/mesos/webui" > --work_dir="/tmp/0z753P/master" --zk_session_timeout="10secs" > [20:11:25]W: [Step 10/10] I0805 20:11:25.004920 24314 master.cpp:427] > Master only allowing auth
[jira] [Comment Edited] (MESOS-6205) mesos-master can not found mesos-slave, and elect a new leader in a short interval
[ https://issues.apache.org/jira/browse/MESOS-6205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15505567#comment-15505567 ] kasim edited comment on MESOS-6205 at 9/20/16 4:53 AM: --- Thanks, empty work_dir works. But I don't understand how this situation happen. At first, I started only one master and zookeeper for test. {code} $ cat /etc/mesos/zk zk://10.142.55.190:2181/mesos {code} The slave on same machine was able to connect master, but other couldn't. So I tried to start three mesos-master and zookeepers to consist cluster, and change `/etc/mesos/zk` to {code} zk://10.142.55.190:2181,10.142.55.196:2181,10.142.55.202:2181/mesos {code}, then got above error. Is this mean I need clear wrok_dir everytime when adding a new mesos-master? was (Author: mithril): Thanks, empty work_dir works. But I don't understand how this situation happen. At first, I started only one master and zookeeper for test. {code} $ cat /etc/mesos/zk zk://10.142.55.190:2181/mesos {code} The slave on same machine was able to connect master, but other couldn't. So I tried to start three master to consist cluster, change `/etc/mesos/zk` to {code} zk://10.142.55.190:2181,10.142.55.196:2181,10.142.55.202:2181/mesos {code}, then got above error. Is this mean I need clear wrok_dir everytime when adding a new mesos-master? > mesos-master can not found mesos-slave, and elect a new leader in a short > interval > -- > > Key: MESOS-6205 > URL: https://issues.apache.org/jira/browse/MESOS-6205 > Project: Mesos > Issue Type: Bug > Components: master > Environment: ubuntu 12 x64, centos 6.5 x64, centos 7.2 x64 >Reporter: kasim > > I follow this > [doc][https://open.mesosphere.com/getting-started/install/#verifying-installation] > to setup mesos cluster. > There are three vm(ubuntu 12, centos 6.5, centos 7.2). > {code} > $ cat /etc/hosts > 10.142.55.190 zk1 > 10.142.55.196 zk2 > 10.142.55.202 zk3 > {code} > config in each mathine: > {code} > $ cat /etc/mesos/zk > zk://10.142.55.190:2181,10.142.55.196:2181,10.142.55.202:2181/mesos > {code} > > After start zookeeper, mesos-master and mesos-slave in three vm, I can view > the mesos webui(10.142.55.190:5050), but agents count is 0. > After a little time, mesos page get error: > {code} > Failed to connect to 10.142.55.190:5050! > Retrying in 16 seconds... > {code} > (I found that zookeeper would elect a new leader in a short interval) > > mesos-master cmd: > {code} > mesos-master --agent_ping_timeout="15secs" > --agent_reregister_timeout="10mins" --allocation_interval="1secs" > --allocator="HierarchicalDRF" --authenticate_agents="false" > --authenticate_frameworks="false" --authenticate_http_frameworks="false" > --authenticate_http_readonly="false" --authenticate_http_readwrite="false" > --authenticators="crammd5" --authorizers="local" --framework_sorter="drf" > --help="false" --hostname_lookup="true" --http_authenticators="basic" > --initialize_driver_logging="true" --ip="10.142.55.190" > --log_auto_initialize="true" --log_dir="/var/log/mesos" --logbufsecs="0" > --logging_level="INFO" --max_agent_ping_timeouts="5" > --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" > --port="5050" --quiet="false" --quorum="2" > --recovery_agent_removal_limit="100%" --registry="replicated_log" > --registry_fetch_timeout="1mins" --registry_store_timeout="20secs" > --registry_strict="false" --root_submissions="true" --user_sorter="drf" > --version="false" --webui_dir="/usr/share/mesos/webui" > --work_dir="/var/lib/mesos" > --zk="zk://10.142.55.190:2181,10.142.55.196:2181,10.142.55.202:2181/mesos" > {code} > mesos-slave cmd: > {code} > mesos-slave --appc_simple_discovery_uri_prefix="http://"; > --appc_store_dir="/tmp/mesos/store/appc" --authenticate_http_readonly="false" > --authenticate_http_readwrite="false" --authenticatee="crammd5" > --authentication_backoff_factor="1secs" --authorizer="local" > --cgroups_cpu_enable_pids_and_tids_count="false" --cgroups_enable_cfs="false" > --cgroups_hierarchy="/sys/fs/cgroup" --cgroups_limit_swap="false" > --cgroups_root="mesos" --container_disk_watch_interval="15secs" > --containerizers="mesos" --default_role="*" --disk_watch_interval="1mins" > --docker="docker" --docker_kill_orphans="true" > --docker_registry="https://registry-1.docker.io"; --docker_remove_delay="6hrs" > --docker_socket="/var/run/docker.sock" --docker_stop_timeout="0ns" > --docker_store_dir="/tmp/mesos/store/docker" > --docker_volume_checkpoint_dir="/var/run/mesos/isolators/docker/volume" > --enforce_container_disk_quota="false" > --executor_registration_timeout="1mins" > --ex
[jira] [Comment Edited] (MESOS-6205) mesos-master can not found mesos-slave, and elect a new leader in a short interval
[ https://issues.apache.org/jira/browse/MESOS-6205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15505567#comment-15505567 ] kasim edited comment on MESOS-6205 at 9/20/16 4:51 AM: --- Thanks, empty work_dir works. But I don't understand how this situation happen. At first, I started only one master and zookeeper for test. {code} $ cat /etc/mesos/zk zk://10.142.55.190:2181/mesos {code} The slave on same machine was able to connect master, but other couldn't. So I tried to start three master to consist cluster, change `/etc/mesos/zk` to {code} zk://10.142.55.190:2181,10.142.55.196:2181,10.142.55.202:2181/mesos {code}, then got above error. Is this mean I need clear wrok_dir everytime when adding a new mesos-master? was (Author: mithril): Thanks, empty work_dir work. But I don't understand how this situation happen. At first, I started only one master and zookeeper for test. {code} $ cat /etc/mesos/zk zk://10.142.55.190:2181/mesos {code} The slave on same machine was able to connect master, but other couldn't. So I tried to start three master to consist cluster, change `/etc/mesos/zk` to {code} zk://10.142.55.190:2181,10.142.55.196:2181,10.142.55.202:2181/mesos {code}, then got above error. Is this mean I need clear wrok_dir everytime when adding a new mesos-master? > mesos-master can not found mesos-slave, and elect a new leader in a short > interval > -- > > Key: MESOS-6205 > URL: https://issues.apache.org/jira/browse/MESOS-6205 > Project: Mesos > Issue Type: Bug > Components: master > Environment: ubuntu 12 x64, centos 6.5 x64, centos 7.2 x64 >Reporter: kasim > > I follow this > [doc][https://open.mesosphere.com/getting-started/install/#verifying-installation] > to setup mesos cluster. > There are three vm(ubuntu 12, centos 6.5, centos 7.2). > {code} > $ cat /etc/hosts > 10.142.55.190 zk1 > 10.142.55.196 zk2 > 10.142.55.202 zk3 > {code} > config in each mathine: > {code} > $ cat /etc/mesos/zk > zk://10.142.55.190:2181,10.142.55.196:2181,10.142.55.202:2181/mesos > {code} > > After start zookeeper, mesos-master and mesos-slave in three vm, I can view > the mesos webui(10.142.55.190:5050), but agents count is 0. > After a little time, mesos page get error: > {code} > Failed to connect to 10.142.55.190:5050! > Retrying in 16 seconds... > {code} > (I found that zookeeper would elect a new leader in a short interval) > > mesos-master cmd: > {code} > mesos-master --agent_ping_timeout="15secs" > --agent_reregister_timeout="10mins" --allocation_interval="1secs" > --allocator="HierarchicalDRF" --authenticate_agents="false" > --authenticate_frameworks="false" --authenticate_http_frameworks="false" > --authenticate_http_readonly="false" --authenticate_http_readwrite="false" > --authenticators="crammd5" --authorizers="local" --framework_sorter="drf" > --help="false" --hostname_lookup="true" --http_authenticators="basic" > --initialize_driver_logging="true" --ip="10.142.55.190" > --log_auto_initialize="true" --log_dir="/var/log/mesos" --logbufsecs="0" > --logging_level="INFO" --max_agent_ping_timeouts="5" > --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" > --port="5050" --quiet="false" --quorum="2" > --recovery_agent_removal_limit="100%" --registry="replicated_log" > --registry_fetch_timeout="1mins" --registry_store_timeout="20secs" > --registry_strict="false" --root_submissions="true" --user_sorter="drf" > --version="false" --webui_dir="/usr/share/mesos/webui" > --work_dir="/var/lib/mesos" > --zk="zk://10.142.55.190:2181,10.142.55.196:2181,10.142.55.202:2181/mesos" > {code} > mesos-slave cmd: > {code} > mesos-slave --appc_simple_discovery_uri_prefix="http://"; > --appc_store_dir="/tmp/mesos/store/appc" --authenticate_http_readonly="false" > --authenticate_http_readwrite="false" --authenticatee="crammd5" > --authentication_backoff_factor="1secs" --authorizer="local" > --cgroups_cpu_enable_pids_and_tids_count="false" --cgroups_enable_cfs="false" > --cgroups_hierarchy="/sys/fs/cgroup" --cgroups_limit_swap="false" > --cgroups_root="mesos" --container_disk_watch_interval="15secs" > --containerizers="mesos" --default_role="*" --disk_watch_interval="1mins" > --docker="docker" --docker_kill_orphans="true" > --docker_registry="https://registry-1.docker.io"; --docker_remove_delay="6hrs" > --docker_socket="/var/run/docker.sock" --docker_stop_timeout="0ns" > --docker_store_dir="/tmp/mesos/store/docker" > --docker_volume_checkpoint_dir="/var/run/mesos/isolators/docker/volume" > --enforce_container_disk_quota="false" > --executor_registration_timeout="1mins" > --executor_shutdown_grace_perio
[jira] [Commented] (MESOS-6205) mesos-master can not found mesos-slave, and elect a new leader in a short interval
[ https://issues.apache.org/jira/browse/MESOS-6205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15505567#comment-15505567 ] kasim commented on MESOS-6205: -- Thanks, empty work_dir work. But I don't understand how this situation happen. At first, I started only one master and zookeeper for test. {code} $ cat /etc/mesos/zk zk://10.142.55.190:2181/mesos {code} The slave on same machine was able to connect master, but other couldn't. So I tried to start three master to consist cluster, change `/etc/mesos/zk` to {code} zk://10.142.55.190:2181,10.142.55.196:2181,10.142.55.202:2181/mesos {code}, then got above error. Is this mean I need clear wrok_dir everytime when adding a new mesos-master? > mesos-master can not found mesos-slave, and elect a new leader in a short > interval > -- > > Key: MESOS-6205 > URL: https://issues.apache.org/jira/browse/MESOS-6205 > Project: Mesos > Issue Type: Bug > Components: master > Environment: ubuntu 12 x64, centos 6.5 x64, centos 7.2 x64 >Reporter: kasim > > I follow this > [doc][https://open.mesosphere.com/getting-started/install/#verifying-installation] > to setup mesos cluster. > There are three vm(ubuntu 12, centos 6.5, centos 7.2). > {code} > $ cat /etc/hosts > 10.142.55.190 zk1 > 10.142.55.196 zk2 > 10.142.55.202 zk3 > {code} > config in each mathine: > {code} > $ cat /etc/mesos/zk > zk://10.142.55.190:2181,10.142.55.196:2181,10.142.55.202:2181/mesos > {code} > > After start zookeeper, mesos-master and mesos-slave in three vm, I can view > the mesos webui(10.142.55.190:5050), but agents count is 0. > After a little time, mesos page get error: > {code} > Failed to connect to 10.142.55.190:5050! > Retrying in 16 seconds... > {code} > (I found that zookeeper would elect a new leader in a short interval) > > mesos-master cmd: > {code} > mesos-master --agent_ping_timeout="15secs" > --agent_reregister_timeout="10mins" --allocation_interval="1secs" > --allocator="HierarchicalDRF" --authenticate_agents="false" > --authenticate_frameworks="false" --authenticate_http_frameworks="false" > --authenticate_http_readonly="false" --authenticate_http_readwrite="false" > --authenticators="crammd5" --authorizers="local" --framework_sorter="drf" > --help="false" --hostname_lookup="true" --http_authenticators="basic" > --initialize_driver_logging="true" --ip="10.142.55.190" > --log_auto_initialize="true" --log_dir="/var/log/mesos" --logbufsecs="0" > --logging_level="INFO" --max_agent_ping_timeouts="5" > --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" > --port="5050" --quiet="false" --quorum="2" > --recovery_agent_removal_limit="100%" --registry="replicated_log" > --registry_fetch_timeout="1mins" --registry_store_timeout="20secs" > --registry_strict="false" --root_submissions="true" --user_sorter="drf" > --version="false" --webui_dir="/usr/share/mesos/webui" > --work_dir="/var/lib/mesos" > --zk="zk://10.142.55.190:2181,10.142.55.196:2181,10.142.55.202:2181/mesos" > {code} > mesos-slave cmd: > {code} > mesos-slave --appc_simple_discovery_uri_prefix="http://"; > --appc_store_dir="/tmp/mesos/store/appc" --authenticate_http_readonly="false" > --authenticate_http_readwrite="false" --authenticatee="crammd5" > --authentication_backoff_factor="1secs" --authorizer="local" > --cgroups_cpu_enable_pids_and_tids_count="false" --cgroups_enable_cfs="false" > --cgroups_hierarchy="/sys/fs/cgroup" --cgroups_limit_swap="false" > --cgroups_root="mesos" --container_disk_watch_interval="15secs" > --containerizers="mesos" --default_role="*" --disk_watch_interval="1mins" > --docker="docker" --docker_kill_orphans="true" > --docker_registry="https://registry-1.docker.io"; --docker_remove_delay="6hrs" > --docker_socket="/var/run/docker.sock" --docker_stop_timeout="0ns" > --docker_store_dir="/tmp/mesos/store/docker" > --docker_volume_checkpoint_dir="/var/run/mesos/isolators/docker/volume" > --enforce_container_disk_quota="false" > --executor_registration_timeout="1mins" > --executor_shutdown_grace_period="5secs" > --fetcher_cache_dir="/tmp/mesos/fetch" --fetcher_cache_size="2GB" > --frameworks_home="" --gc_delay="1weeks" --gc_disk_headroom="0.1" > --hadoop_home="" --help="false" --hostname="10.142.55.190" > --hostname_lookup="true" --http_authenticators="basic" > --http_command_executor="false" --image_provisioner_backend="copy" > --initialize_driver_logging="true" --ip="10.142.55.190" > --isolation="posix/cpu,posix/mem" --launcher="posix" > --launcher_dir="/usr/libexec/mesos" --log_dir="/var/log/mesos" > --logbufsecs="0" --logging_level="INFO" > --master="zk://10.142.55.190:2181,10.142.55.196:2181,10.
[jira] [Updated] (MESOS-6210) Master redirect with suffix gets in redirect loop
[ https://issues.apache.org/jira/browse/MESOS-6210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kone updated MESOS-6210: -- Labels: newbie (was: ) I think the bug is here: https://github.com/apache/mesos/blob/master/src/master/http.cpp#L2036 We need to do a "contains" check instead of an "==" check. cc [~haosd...@gmail.com] [~drcrallen] Would you like to send a patch for this? > Master redirect with suffix gets in redirect loop > - > > Key: MESOS-6210 > URL: https://issues.apache.org/jira/browse/MESOS-6210 > Project: Mesos > Issue Type: Bug > Components: HTTP API >Reporter: Charles Allen > Labels: newbie > > Trying to go to a URI like > {{http://SOME_MASTER:5050/master/redirect/master/frameworks}} ends up in a > redirect loop. > The expected behavior is to either not support anything after {{redirect}} in > the path (redirect must be handled by a smart client), or to redirect to the > suffix (redirect can be handled by a dumb client). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6207) Python bindings fail to build with custom SVN installation path
[ https://issues.apache.org/jira/browse/MESOS-6207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15505148#comment-15505148 ] Vinod Kone commented on MESOS-6207: --- [~tillt] is this something you are interested in shepherding? > Python bindings fail to build with custom SVN installation path > --- > > Key: MESOS-6207 > URL: https://issues.apache.org/jira/browse/MESOS-6207 > Project: Mesos > Issue Type: Bug > Components: build >Affects Versions: 1.0.1 >Reporter: Ilya Pronin >Priority: Trivial > > In {{src/Makefile.am}} {{PYTHON_LDFLAGS}} variable is used while building > Python bindings. This variable picks {{LDFLAGS}} during configuration phase > before we check for custom SVN installation path and misses > {{-L$\{with_svn\}/lib}} flag. That causes a link error on systems with > uncommon SVN installation path. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4668) Agent's /state endpoint does not include full reservation information
[ https://issues.apache.org/jira/browse/MESOS-4668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15504971#comment-15504971 ] Yan Xu commented on MESOS-4668: --- As a follow up: MESOS-6211 > Agent's /state endpoint does not include full reservation information > - > > Key: MESOS-4668 > URL: https://issues.apache.org/jira/browse/MESOS-4668 > Project: Mesos > Issue Type: Bug > Components: slave >Reporter: Neil Conway >Assignee: Yan Xu >Priority: Minor > Labels: endpoint, reservations > Fix For: 1.1.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-6211) Add total resources to the agent operator API.
Yan Xu created MESOS-6211: - Summary: Add total resources to the agent operator API. Key: MESOS-6211 URL: https://issues.apache.org/jira/browse/MESOS-6211 Project: Mesos Issue Type: Improvement Reporter: Yan Xu Looks like it can be a field in {code} message GetAgent { optional AgentInfo agent_info = 1; + repeated Resource total_resources = 2; } {code} The point is to include reservations. The name is consistent with [master.proto|https://github.com/apache/mesos/blob/29236068f23b6cfbd19d7a4b5b96be852818a356/include/mesos/v1/master/master.proto#L305]. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-6062) mesos-agent should autodetect mount-type volume sizes
[ https://issues.apache.org/jira/browse/MESOS-6062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15489143#comment-15489143 ] Anindya Sinha edited comment on MESOS-6062 at 9/19/16 10:45 PM: RRs published for review: https://reviews.apache.org/r/51999 https://reviews.apache.org/r/52001 https://reviews.apache.org/r/52002 https://reviews.apache.org/r/51879 https://reviews.apache.org/r/51880 https://reviews.apache.org/r/52071/ was (Author: anindya.sinha): RRs published for review: https://reviews.apache.org/r/51999 https://reviews.apache.org/r/52001 https://reviews.apache.org/r/52002 https://reviews.apache.org/r/51879 https://reviews.apache.org/r/51880 > mesos-agent should autodetect mount-type volume sizes > - > > Key: MESOS-6062 > URL: https://issues.apache.org/jira/browse/MESOS-6062 > Project: Mesos > Issue Type: Improvement > Components: slave >Reporter: Yan Xu >Assignee: Anindya Sinha > > When dealing with a large fleet of machines it could be cumbersome to > construct the resources JSON file that varies from host to host. Mesos > already auto-detects resources such as cpus, mem and "root" disk, it should > extend it to the MOUNT type disk as it's pretty clear that the value should > be the size of entire volume. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-4668) Agent's /state endpoint does not include full reservation information
[ https://issues.apache.org/jira/browse/MESOS-4668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15491011#comment-15491011 ] Yan Xu edited comment on MESOS-4668 at 9/19/16 10:35 PM: - https://reviews.apache.org/r/51866 https://reviews.apache.org/r/51867 https://reviews.apache.org/r/51868 https://reviews.apache.org/r/52070 The test (shared with MESOS-6085) https://reviews.apache.org/r/51870 was (Author: xujyan): https://reviews.apache.org/r/51866 https://reviews.apache.org/r/51867 https://reviews.apache.org/r/51868 The test (shared with MESOS-6085) https://reviews.apache.org/r/51870 > Agent's /state endpoint does not include full reservation information > - > > Key: MESOS-4668 > URL: https://issues.apache.org/jira/browse/MESOS-4668 > Project: Mesos > Issue Type: Bug > Components: slave >Reporter: Neil Conway >Assignee: Yan Xu >Priority: Minor > Labels: endpoint, reservations > Fix For: 1.1.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5821) Clean up the billions of compiler warnings on MSVC
[ https://issues.apache.org/jira/browse/MESOS-5821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15504857#comment-15504857 ] Joseph Wu commented on MESOS-5821: -- {code} commit 4a91c1ed8c86ce2100dba24b321fc578ad3d0473 Author: Daniel Pravat Date: Mon Sep 19 14:21:10 2016 -0700 Windows: Fixed warnings in `duration.hpp`. Captures the result of `std::ostream::precision` via its return type `std::streamsize` rather than implicitly casting to `long`. Review: https://reviews.apache.org/r/52061/ {code} {code} commit a5993cec547f299a8edff7662599c68e208ecdac Author: Daniel Pravat Date: Mon Sep 19 14:51:12 2016 -0700 Windows: Fixed warnings in `windows/os.hpp`. Captured the result of `Process32First` via its return type `BOOL` rather than implicitly casting to `bool`. Review: https://reviews.apache.org/r/52063/ {code} > Clean up the billions of compiler warnings on MSVC > -- > > Key: MESOS-5821 > URL: https://issues.apache.org/jira/browse/MESOS-5821 > Project: Mesos > Issue Type: Bug > Components: slave >Reporter: Alex Clemmer >Assignee: Daniel Pravat > Labels: mesosphere, slave > > Clean builds of Mesos on Windows will result in approximately {{5800 > Warning(s)}} or more. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-5986) SSL Socket CHECK can fail after socket receives EOF
[ https://issues.apache.org/jira/browse/MESOS-5986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph Wu updated MESOS-5986: - Fix Version/s: 0.27.4 0.28.3 > SSL Socket CHECK can fail after socket receives EOF > --- > > Key: MESOS-5986 > URL: https://issues.apache.org/jira/browse/MESOS-5986 > Project: Mesos > Issue Type: Bug > Components: libprocess >Affects Versions: 1.0.0 >Reporter: Greg Mann >Assignee: Greg Mann >Priority: Blocker > Labels: mesosphere > Fix For: 0.28.3, 1.0.1, 0.27.4 > > > While writing a test for MESOS-3753, I encountered a bug where [this > check|https://github.com/apache/mesos/blob/853821cafcca3550b9c7bdaba5262d73869e2ee1/3rdparty/libprocess/src/libevent_ssl_socket.cpp#L708] > fails at the very end of the test body, while objects in the stack frame are > being destroyed. After adding some debug logging output, I produced the > following: > {code} > I0804 08:32:33.263211 273793024 libevent_ssl_socket.cpp:681] *** in send()17 > I0804 08:32:33.263209 273256448 process.cpp:2970] Cleaning up > __limiter__(3)@127.0.0.1:55688 > I0804 08:32:33.263263 275939328 libevent_ssl_socket.cpp:152] *** in > initialize(): 14 > I0804 08:32:33.263206 272719872 process.cpp:2865] Resuming > (61)@127.0.0.1:55688 at 2016-08-04 15:32:33.263261952+00:00 > I0804 08:32:33.263327 275939328 libevent_ssl_socket.cpp:584] *** in recv()14 > I0804 08:32:33.263337 272719872 hierarchical.cpp:571] Agent > e2a49340-34ec-403f-a5a4-15e29c4a2434-S0 deactivated > I0804 08:32:33.263322 275402752 process.cpp:2865] Resuming > help@127.0.0.1:55688 at 2016-08-04 15:32:33.263343104+00:00 > I0804 08:32:33.263510 275939328 libevent_ssl_socket.cpp:322] *** in > event_callback(bev) > I0804 08:32:33.263536 275939328 libevent_ssl_socket.cpp:353] *** in > event_callback check for EOF/CONNECTED/ERROR: 19 > I0804 08:32:33.263592 275939328 libevent_ssl_socket.cpp:159] *** in > shutdown(): 19 > I0804 08:32:33.263622 1985901312 process.cpp:3170] Donating thread to > (87)@127.0.0.1:55688 while waiting > I0804 08:32:33.263639 274329600 process.cpp:2865] Resuming > __http__(12)@127.0.0.1:55688 at 2016-08-04 15:32:33.263653888+00:00 > I0804 08:32:33.263659 1985901312 process.cpp:2865] Resuming > (87)@127.0.0.1:55688 at 2016-08-04 15:32:33.263671040+00:00 > I0804 08:32:33.263730 1985901312 process.cpp:2970] Cleaning up > (87)@127.0.0.1:55688 > I0804 08:32:33.263741 275939328 libevent_ssl_socket.cpp:322] *** in > event_callback(bev) > I0804 08:32:33.263736 274329600 process.cpp:2970] Cleaning up > __http__(12)@127.0.0.1:55688 > I0804 08:32:33.263778 275939328 libevent_ssl_socket.cpp:353] *** in > event_callback check for EOF/CONNECTED/ERROR: 17 > I0804 08:32:33.263818 275939328 libevent_ssl_socket.cpp:159] *** in > shutdown(): 17 > I0804 08:32:33.263839 272183296 process.cpp:2865] Resuming > help@127.0.0.1:55688 at 2016-08-04 15:32:33.263857920+00:00 > I0804 08:32:33.263933 273793024 process.cpp:2865] Resuming > __gc__@127.0.0.1:55688 at 2016-08-04 15:32:33.263951104+00:00 > I0804 08:32:33.264034 275939328 libevent_ssl_socket.cpp:681] *** in send()17 > I0804 08:32:33.264020 272719872 process.cpp:2865] Resuming > __http__(11)@127.0.0.1:55688 at 2016-08-04 15:32:33.264041984+00:00 > I0804 08:32:33.264036 274329600 process.cpp:2865] Resuming > status-update-manager(3)@127.0.0.1:55688 at 2016-08-04 > 15:32:33.264056064+00:00 > I0804 08:32:33.264071 272719872 process.cpp:2970] Cleaning up > __http__(11)@127.0.0.1:55688 > I0804 08:32:33.264088 274329600 process.cpp:2970] Cleaning up > status-update-manager(3)@127.0.0.1:55688 > I0804 08:32:33.264086 275939328 libevent_ssl_socket.cpp:721] *** sending on > socket: 17, data: 0 > I0804 08:32:33.264112 272183296 process.cpp:2865] Resuming > (89)@127.0.0.1:55688 at 2016-08-04 15:32:33.264126976+00:00 > I0804 08:32:33.264118 275402752 process.cpp:2865] Resuming > help@127.0.0.1:55688 at 2016-08-04 15:32:33.264144896+00:00 > I0804 08:32:33.264149 272183296 process.cpp:2970] Cleaning up > (89)@127.0.0.1:55688 > I0804 08:32:33.264202 275939328 libevent_ssl_socket.cpp:281] *** in > send_callback(bev) > I0804 08:32:33.264400 273793024 process.cpp:3170] Donating thread to > (86)@127.0.0.1:55688 while waiting > I0804 08:32:33.264413 273256448 process.cpp:2865] Resuming > (76)@127.0.0.1:55688 at 2016-08-04 15:32:33.264428032+00:00 > I0804 08:32:33.296268 275939328 libevent_ssl_socket.cpp:300] *** in > send_callback(): 17 > I0804 08:32:33.296419 273256448 process.cpp:2970] Cleaning up > (76)@127.0.0.1:55688 > I0804 08:32:33.296357 273793024 process.cpp:2865] Resuming > (86)@127.0.0.1:55688 at 2016-08-04 15:32:33.296414976+00:00 > I0804 08:32:33.296464 273793024 process.cpp:2970] Cleaning up > (86)@127.0.0.1:55688 > I0804 08:32:33.
[jira] [Updated] (MESOS-6152) Resource leak in libevent_ssl_socket.cpp.
[ https://issues.apache.org/jira/browse/MESOS-6152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph Wu updated MESOS-6152: - Fix Version/s: 1.0.2 0.27.4 0.28.3 > Resource leak in libevent_ssl_socket.cpp. > - > > Key: MESOS-6152 > URL: https://issues.apache.org/jira/browse/MESOS-6152 > Project: Mesos > Issue Type: Bug >Reporter: Joerg Schad >Assignee: Benjamin Bannier > Labels: coverity > Fix For: 0.28.3, 0.27.4, 1.1.0, 1.0.2 > > > Coverity detected the following resource leak. > IMO {code} if (fd == -1) {code} should be {code} if (owned_fd == -1) {code}. > {code} > // Duplicate the file descriptor because Libevent will take ownership > 754 // and control the lifecycle separately. > 755 // > 756 // TODO(josephw): We can avoid duplicating the file descriptor in > 757 // future versions of Libevent. In Libevent versions 2.1.2 and later, > 758 // we may use `evbuffer_file_segment_new` and `evbuffer_add_file_segment` > 759 // instead of `evbuffer_add_file`. > 3. open_fn: Returning handle opened by dup. > 4. var_assign: Assigning: owned_fd = handle returned from dup(fd). > 760 int owned_fd = dup(fd); > CID 1372873: Argument cannot be negative (REVERSE_NEGATIVE) [select > issue] > 5. Condition fd == -1, taking true branch. > 761 if (fd == -1) { > > CID 1372872 (#1 of 1): Resource leak (RESOURCE_LEAK) > 6. leaked_handle: Handle variable owned_fd going out of scope leaks the > handle. > 762return Failure(ErrnoError("Failed to duplicate file descriptor")); > 763 } > {code} > https://scan5.coverity.com/reports.htm#v39597/p10429/fileInstanceId=98881747&defectInstanceId=28450468 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6104) Potential FD double close in libevent's implementation of `sendfile`.
[ https://issues.apache.org/jira/browse/MESOS-6104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph Wu updated MESOS-6104: - Fix Version/s: 1.0.2 0.27.4 0.28.3 > Potential FD double close in libevent's implementation of `sendfile`. > - > > Key: MESOS-6104 > URL: https://issues.apache.org/jira/browse/MESOS-6104 > Project: Mesos > Issue Type: Bug > Components: webui >Affects Versions: 0.27.3, 0.28.2, 1.0.1 >Reporter: Joseph Wu >Assignee: Joseph Wu >Priority: Critical > Labels: mesosphere, ssl > Fix For: 0.28.3, 0.27.4, 1.1.0, 1.0.2 > > > Repro copied from: https://reviews.apache.org/r/51509/ > It is possible to make the master CHECK fail by repeatedly hitting the web UI > and reloading the static assets: > 1) Paste lots of text (16KB or more) of text into > `src/webui/master/static/home.html`. The more text, the more reliable the > repro. > 2) Start the master with SSL enabled: > {code} > LIBPROCESS_SSL_ENABLED=true LIBPROCESS_SSL_KEY_FILE=key.pem > LIBPROCESS_SSL_CERT_FILE=cert.pem bin/mesos-master.sh --work_dir=/tmp/master > {code} > 3) Run two instances of this python script repeatedly: > {code} > import socket > import ssl > s = ssl.wrap_socket(socket.socket()) > s.connect(("localhost", 5050)) > s.sendall("""GET /static/home.html HTTP/1.1 > User-Agent: foobar > Host: localhost:5050 > Accept: */* > Connection: Keep-Alive > """) > # The HTTP part of the response > print s.recv(1000) > {code} > i.e. > {code} > while python test.py; do :; done & while python test.py; do :; done > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5909) Stout "OsTest.User" test can fail on some systems
[ https://issues.apache.org/jira/browse/MESOS-5909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15504625#comment-15504625 ] Mao Geng commented on MESOS-5909: - Got it. Thanks! Look forward to addressing review comments. > Stout "OsTest.User" test can fail on some systems > - > > Key: MESOS-5909 > URL: https://issues.apache.org/jira/browse/MESOS-5909 > Project: Mesos > Issue Type: Bug > Components: stout >Reporter: Kapil Arya >Assignee: Gilbert Song > Labels: mesosphere > Attachments: MESOS-5909-fix.diff > > > Libc call {{getgrouplist}} doesn't return the {{gid}} list in a sorted manner > (in my case, it's returning "471 100") ... whereas {{id -G}} return a sorted > list ("100 471" in my case) causing the validation inside the loop to fail. > We should sort both lists before comparing the values. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-6180) Several tests are flaky, with futures timing out early
[ https://issues.apache.org/jira/browse/MESOS-6180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15497951#comment-15497951 ] Greg Mann edited comment on MESOS-6180 at 9/19/16 8:38 PM: --- Thanks for the patch to address the mount leak [~jieyu]! (https://reviews.apache.org/r/51963/) I ran {{sudo MESOS_VERBOSE=1 GLOG_v=2 GTEST_REPEAT=-1 GTEST_BREAK_ON_FAILURE=1 GTEST_FILTER="\*MesosContainerizerSlaveRecoveryTest.CGROUPS_ROOT_PidNamespace\*" bin/mesos-tests.sh}} and stressed my machine with {{stress -c N -i N -m N -d 1}}, where {{N}} is number of cores, and I was able to reproduce a couple of these offer future timeout failures after a few tens of repetitions. I attached logs above as {{flaky-containerizer-pid-namespace-forward.txt}} and {{flaky-containerizer-pid-namespace-backward.txt}}. We can see the master beginning agent registration, but we never see the line {{Registered agent ...}} from {{Master::_registerSlave()}}, which indicates that registration is complete and the registered message has been sent to the agent: {code} I0917 01:35:17.184216 480 master.cpp:4886] Registering agent at slave(11)@172.31.1.104:57341 (ip-172-31-1-104.us-west-2.compute.internal) with id fa7a42d0-5d0c-4799-b19f-2a85b43039f3-S0 I0917 01:35:17.184232 474 process.cpp:2707] Resuming __reaper__(1)@172.31.1.104:57341 at 2016-09-17 01:35:17.184222976+00:00 I0917 01:35:17.184377 474 process.cpp:2707] Resuming registrar(6)@172.31.1.104:57341 at 2016-09-17 01:35:17.184371968+00:00 I0917 01:35:17.184554 474 registrar.cpp:464] Applied 1 operations in 79217ns; attempting to update the registry I0917 01:35:17.184953 474 process.cpp:2697] Spawned process __latch__(141)@172.31.1.104:57341 I0917 01:35:17.184990 485 process.cpp:2707] Resuming log-storage(6)@172.31.1.104:57341 at 2016-09-17 01:35:17.184982016+00:00 I0917 01:35:17.185561 485 process.cpp:2707] Resuming log-writer(6)@172.31.1.104:57341 at 2016-09-17 01:35:17.185552896+00:00 I0917 01:35:17.185609 485 log.cpp:577] Attempting to append 434 bytes to the log I0917 01:35:17.185804 485 process.cpp:2707] Resuming log-coordinator(6)@172.31.1.104:57341 at 2016-09-17 01:35:17.185797888+00:00 I0917 01:35:17.185863 485 coordinator.cpp:348] Coordinator attempting to write APPEND action at position 3 I0917 01:35:17.185998 485 process.cpp:2697] Spawned process log-write(29)@172.31.1.104:57341 I0917 01:35:17.186030 475 process.cpp:2707] Resuming log-write(29)@172.31.1.104:57341 at 2016-09-17 01:35:17.186021888+00:00 I0917 01:35:17.186189 475 process.cpp:2707] Resuming log-network(6)@172.31.1.104:57341 at 2016-09-17 01:35:17.186182912+00:00 I0917 01:35:17.186275 475 process.cpp:2707] Resuming log-write(29)@172.31.1.104:57341 at 2016-09-17 01:35:17.186267904+00:00 I0917 01:35:17.186424 475 process.cpp:2707] Resuming log-network(6)@172.31.1.104:57341 at 2016-09-17 01:35:17.186416896+00:00 I0917 01:35:17.186575 475 process.cpp:2697] Spawned process __req_res__(55)@172.31.1.104:57341 I0917 01:35:17.186724 475 process.cpp:2707] Resuming log-write(29)@172.31.1.104:57341 at 2016-09-17 01:35:17.186717952+00:00 I0917 01:35:17.186609 485 process.cpp:2707] Resuming __req_res__(55)@172.31.1.104:57341 at 2016-09-17 01:35:17.186601984+00:00 I0917 01:35:17.186898 485 process.cpp:2707] Resuming log-replica(6)@172.31.1.104:57341 at 2016-09-17 01:35:17.186892032+00:00 I0917 01:35:17.186962 485 replica.cpp:537] Replica received write request for position 3 from __req_res__(55)@172.31.1.104:57341 I0917 01:35:17.185014 471 process.cpp:2707] Resuming __gc__@172.31.1.104:57341 at 2016-09-17 01:35:17.185008896+00:00 I0917 01:35:17.185036 480 process.cpp:2707] Resuming __latch__(141)@172.31.1.104:57341 at 2016-09-17 01:35:17.185029120+00:00 I0917 01:35:17.196358 482 process.cpp:2707] Resuming slave(11)@172.31.1.104:57341 at 2016-09-17 01:35:17.196335104+00:00 I0917 01:35:17.196900 482 slave.cpp:1471] Will retry registration in 25.224033ms if necessary I0917 01:35:17.197029 482 process.cpp:2707] Resuming master@172.31.1.104:57341 at 2016-09-17 01:35:17.197024000+00:00 I0917 01:35:17.197157 482 master.cpp:4874] Ignoring register agent message from slave(11)@172.31.1.104:57341 (ip-172-31-1-104.us-west-2.compute.internal) as admission is already in progress I0917 01:35:17.224309 482 process.cpp:2707] Resuming slave(11)@172.31.1.104:57341 at 2016-09-17 01:35:17.224284928+00:00 I0917 01:35:17.224845 482 slave.cpp:1471] Will retry registration in 63.510932ms if necessary I0917 01:35:17.224900 475 process.cpp:2707] Resuming master@172.31.1.104:57341 at 2016-09-17 01:35:17.224888064+00:00 I0917 01:35:17.225109 475 master.cpp:4874] Ignoring register agent message from slave(11)@172.31.1.104:57341 (ip-172-31-1-104.us-west-2.compute.internal) as admission is already in progress {code} was (Author: greggomann): Thanks for t
[jira] [Created] (MESOS-6210) Master redirect with suffix gets in redirect loop
Charles Allen created MESOS-6210: Summary: Master redirect with suffix gets in redirect loop Key: MESOS-6210 URL: https://issues.apache.org/jira/browse/MESOS-6210 Project: Mesos Issue Type: Bug Components: HTTP API Reporter: Charles Allen Trying to go to a URI like {{http://SOME_MASTER:5050/master/redirect/master/frameworks}} ends up in a redirect loop. The expected behavior is to either not support anything after {{redirect}} in the path (redirect must be handled by a smart client), or to redirect to the suffix (redirect can be handled by a dumb client). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3011) Publish release documentation for major releases on website
[ https://issues.apache.org/jira/browse/MESOS-3011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15504572#comment-15504572 ] Tim Anderegg commented on MESOS-3011: - Whoops, fixed :) Thanks! > Publish release documentation for major releases on website > --- > > Key: MESOS-3011 > URL: https://issues.apache.org/jira/browse/MESOS-3011 > Project: Mesos > Issue Type: Documentation > Components: documentation, project website >Reporter: Paul Brett >Assignee: Tim Anderegg > Labels: documentation, mesosphere > > Currently, the website only provides a single version of the documentation. > We should publish documentation for each release on the website independently > (for example as https://mesos.apache.org/documentation/0.22/index.html, > https://mesos.apache.org/documentation/0.23/index.html) and make latest > redirect to the current version. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3011) Publish release documentation for major releases on website
[ https://issues.apache.org/jira/browse/MESOS-3011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15504552#comment-15504552 ] Neil Conway commented on MESOS-3011: [~tanderegg] -- seems like the review request is private? > Publish release documentation for major releases on website > --- > > Key: MESOS-3011 > URL: https://issues.apache.org/jira/browse/MESOS-3011 > Project: Mesos > Issue Type: Documentation > Components: documentation, project website >Reporter: Paul Brett >Assignee: Tim Anderegg > Labels: documentation, mesosphere > > Currently, the website only provides a single version of the documentation. > We should publish documentation for each release on the website independently > (for example as https://mesos.apache.org/documentation/0.22/index.html, > https://mesos.apache.org/documentation/0.23/index.html) and make latest > redirect to the current version. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3011) Publish release documentation for major releases on website
[ https://issues.apache.org/jira/browse/MESOS-3011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15504544#comment-15504544 ] Tim Anderegg commented on MESOS-3011: - [~vinodkone] Sorry this took so long, if you remember you offered to shepherd this change a whlie back, here's the review: https://reviews.apache.org/r/52064/ Let me know if you have any questions! > Publish release documentation for major releases on website > --- > > Key: MESOS-3011 > URL: https://issues.apache.org/jira/browse/MESOS-3011 > Project: Mesos > Issue Type: Documentation > Components: documentation, project website >Reporter: Paul Brett >Assignee: Tim Anderegg > Labels: documentation, mesosphere > > Currently, the website only provides a single version of the documentation. > We should publish documentation for each release on the website independently > (for example as https://mesos.apache.org/documentation/0.22/index.html, > https://mesos.apache.org/documentation/0.23/index.html) and make latest > redirect to the current version. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6130) Make the disk usage isolator nesting-aware
[ https://issues.apache.org/jira/browse/MESOS-6130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15504499#comment-15504499 ] Greg Mann commented on MESOS-6130: -- Jie made these changes to the disk/du isolator: {code} Commit: a4fd86bce2d2b6b53bce2ea95e3b69c5ab011ff8 Parents: 8221021 Author: Jie Yu Authored: Sat Sep 17 12:26:52 2016 -0700 Committer: Jie Yu Committed: Sat Sep 17 13:02:37 2016 -0700 {code} Note that we still need to add tests for this, but we'll wait briefly for the rest of the containerizer changes to land before doing so. > Make the disk usage isolator nesting-aware > -- > > Key: MESOS-6130 > URL: https://issues.apache.org/jira/browse/MESOS-6130 > Project: Mesos > Issue Type: Task > Components: isolation >Reporter: Greg Mann >Assignee: Greg Mann > Labels: mesosphere > > With the addition of task groups, the disk usage isolator must be updated. > Since sub-container sandboxes are nested within the parent container's > sandbox, the isolator must exclude these folders from its usage calculation > when examining the parent container's disk usage. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5856) Logrotate ContainerLogger module does not rotate logs when run as root with --switch_user
[ https://issues.apache.org/jira/browse/MESOS-5856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15504438#comment-15504438 ] Joseph Wu commented on MESOS-5856: -- Added a unit test to assert the expected behavior: https://reviews.apache.org/r/52059/ > Logrotate ContainerLogger module does not rotate logs when run as root with > --switch_user > - > > Key: MESOS-5856 > URL: https://issues.apache.org/jira/browse/MESOS-5856 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.27.0, 0.28.0, 1.0.0 >Reporter: Joseph Wu >Assignee: Sivaram Kannan >Priority: Critical > Labels: logger, mesosphere, newbie > > The logrotate ContainerLogger module runs as the agent's user. In most > cases, this is {{root}}. > When {{logrotate}} is run as root, there is an additional check the > configuration files must pass (because a root {{logrotate}} needs to be > secured against non-root modifications to the configuration): > https://github.com/logrotate/logrotate/blob/fe80cb51a2571ca35b1a7c8ba0695db5a68feaba/config.c#L807-L815 > Log rotation will fail under the following scenario: > 1) The agent is run with {{--switch_user}} (default: true) > 2) A task is launched with a non-root user specified > 3) The logrotate module spawns a few companion processes (as root) and this > creates the {{stdout}}, {{stderr}}, {{stdout.logrotate.conf}}, and > {{stderr.logrotate.conf}} files (as root). This step races with the next > step. > 4) The Mesos containerizer and Fetcher will {{chown}} the task's sandbox to > the non-root user. Including the files just created. > 5) When {{logrotate}} is run, it will skip any non-root configuration files. > This means the files are not rotated. > > Fix: The logrotate module's companion processes should call {{setuid}} and > {{setgid}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5909) Stout "OsTest.User" test can fail on some systems
[ https://issues.apache.org/jira/browse/MESOS-5909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15504390#comment-15504390 ] Joseph Wu commented on MESOS-5909: -- Go ahead. All reviewers are specified by the person creating the ReviewBoard review. No need to ask permission, but it's often good to ping the people you've added via JIRA/email/Slack. > Stout "OsTest.User" test can fail on some systems > - > > Key: MESOS-5909 > URL: https://issues.apache.org/jira/browse/MESOS-5909 > Project: Mesos > Issue Type: Bug > Components: stout >Reporter: Kapil Arya >Assignee: Gilbert Song > Labels: mesosphere > Attachments: MESOS-5909-fix.diff > > > Libc call {{getgrouplist}} doesn't return the {{gid}} list in a sorted manner > (in my case, it's returning "471 100") ... whereas {{id -G}} return a sorted > list ("100 471" in my case) causing the validation inside the loop to fail. > We should sort both lists before comparing the values. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6002) The whiteout file cannot be removed correctly using aufs backend.
[ https://issues.apache.org/jira/browse/MESOS-6002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15504388#comment-15504388 ] Stéphane Cottin commented on MESOS-6002: Same issue using overlayfs : {code} Failed to remove whiteout file '/mnt/mesos/provisioner/containers/001bbc00-e460-4c15-a445-e3dd44f3dd8c/backends/overlay/rootfses/acb9cba3-671d-41a8-ad73-9b160f3ca048/var/lib/apt/lists/partial/.wh..opq': No such file or directory {code} Can be reproduced with official postgres, rabbitmq and many more docker images, all deleting the same folder in multiple RUN calls. > The whiteout file cannot be removed correctly using aufs backend. > - > > Key: MESOS-6002 > URL: https://issues.apache.org/jira/browse/MESOS-6002 > Project: Mesos > Issue Type: Bug > Components: containerization > Environment: Ubuntu 14, Ubuntu 12 > Or any os with aufs module >Reporter: Gilbert Song > Labels: aufs, backend, containerizer > > The whiteout file is not removed correctly when using the aufs backend in > unified containerizer. It can be verified by this unit test with the aufs > manually specified. > {noformat} > [20:11:24] : [Step 10/10] [ RUN ] > ProvisionerDockerPullerTest.ROOT_INTERNET_CURL_Whiteout > [20:11:24]W: [Step 10/10] I0805 20:11:24.986734 24295 cluster.cpp:155] > Creating default 'local' authorizer > [20:11:25]W: [Step 10/10] I0805 20:11:25.001153 24295 leveldb.cpp:174] > Opened db in 14.308627ms > [20:11:25]W: [Step 10/10] I0805 20:11:25.003731 24295 leveldb.cpp:181] > Compacted db in 2.558329ms > [20:11:25]W: [Step 10/10] I0805 20:11:25.003749 24295 leveldb.cpp:196] > Created db iterator in 3086ns > [20:11:25]W: [Step 10/10] I0805 20:11:25.003754 24295 leveldb.cpp:202] > Seeked to beginning of db in 595ns > [20:11:25]W: [Step 10/10] I0805 20:11:25.003758 24295 leveldb.cpp:271] > Iterated through 0 keys in the db in 314ns > [20:11:25]W: [Step 10/10] I0805 20:11:25.003769 24295 replica.cpp:776] > Replica recovered with log positions 0 -> 0 with 1 holes and 0 unlearned > [20:11:25]W: [Step 10/10] I0805 20:11:25.004086 24315 recover.cpp:451] > Starting replica recovery > [20:11:25]W: [Step 10/10] I0805 20:11:25.004251 24312 recover.cpp:477] > Replica is in EMPTY status > [20:11:25]W: [Step 10/10] I0805 20:11:25.004546 24314 replica.cpp:673] > Replica in EMPTY status received a broadcasted recover request from > __req_res__(5640)@172.30.2.105:36006 > [20:11:25]W: [Step 10/10] I0805 20:11:25.004607 24312 recover.cpp:197] > Received a recover response from a replica in EMPTY status > [20:11:25]W: [Step 10/10] I0805 20:11:25.004762 24313 recover.cpp:568] > Updating replica status to STARTING > [20:11:25]W: [Step 10/10] I0805 20:11:25.004776 24314 master.cpp:375] > Master 21665992-d47e-402f-a00c-6f8fab613019 (ip-172-30-2-105.mesosphere.io) > started on 172.30.2.105:36006 > [20:11:25]W: [Step 10/10] I0805 20:11:25.004787 24314 master.cpp:377] Flags > at startup: --acls="" --agent_ping_timeout="15secs" > --agent_reregister_timeout="10mins" --allocation_interval="1secs" > --allocator="HierarchicalDRF" --authenticate_agents="true" > --authenticate_frameworks="true" --authenticate_http_frameworks="true" > --authenticate_http_readonly="true" --authenticate_http_readwrite="true" > --authenticators="crammd5" --authorizers="local" > --credentials="/tmp/0z753P/credentials" --framework_sorter="drf" > --help="false" --hostname_lookup="true" --http_authenticators="basic" > --http_framework_authenticators="basic" --initialize_driver_logging="true" > --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" > --max_agent_ping_timeouts="5" --max_completed_frameworks="50" > --max_completed_tasks_per_framework="1000" --quiet="false" > --recovery_agent_removal_limit="100%" --registry="replicated_log" > --registry_fetch_timeout="1mins" --registry_store_timeout="100secs" > --registry_strict="true" --root_submissions="true" --user_sorter="drf" > --version="false" --webui_dir="/usr/local/share/mesos/webui" > --work_dir="/tmp/0z753P/master" --zk_session_timeout="10secs" > [20:11:25]W: [Step 10/10] I0805 20:11:25.004920 24314 master.cpp:427] > Master only allowing authenticated frameworks to register > [20:11:25]W: [Step 10/10] I0805 20:11:25.004930 24314 master.cpp:441] > Master only allowing authenticated agents to register > [20:11:25]W: [Step 10/10] I0805 20:11:25.004935 24314 master.cpp:454] > Master only allowing authenticated HTTP frameworks to register > [20:11:25]W: [Step 10/10] I0805 20:11:25.004942 24314 credentials.hpp:37] > Loading credentials for authentication from '/tmp/0z753P/credentials' > [20:11:25]W: [Step 10/10] I0805 20:11:25.005018 24314 master.cpp:499] Using > default 'crammd5' aut
[jira] [Commented] (MESOS-5909) Stout "OsTest.User" test can fail on some systems
[ https://issues.apache.org/jira/browse/MESOS-5909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15504307#comment-15504307 ] Mao Geng commented on MESOS-5909: - Thanks [~kaysoky]! I created https://reviews.apache.org/r/52048/ and added [~gilbert] [~karya] as reviewers. May I add you as a reviewer too? > Stout "OsTest.User" test can fail on some systems > - > > Key: MESOS-5909 > URL: https://issues.apache.org/jira/browse/MESOS-5909 > Project: Mesos > Issue Type: Bug > Components: stout >Reporter: Kapil Arya >Assignee: Gilbert Song > Labels: mesosphere > Attachments: MESOS-5909-fix.diff > > > Libc call {{getgrouplist}} doesn't return the {{gid}} list in a sorted manner > (in my case, it's returning "471 100") ... whereas {{id -G}} return a sorted > list ("100 471" in my case) causing the validation inside the loop to fail. > We should sort both lists before comparing the values. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6205) mesos-master can not found mesos-slave, and elect a new leader in a short interval
[ https://issues.apache.org/jira/browse/MESOS-6205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph Wu updated MESOS-6205: - Description: I follow this [doc][https://open.mesosphere.com/getting-started/install/#verifying-installation] to setup mesos cluster. There are three vm(ubuntu 12, centos 6.5, centos 7.2). {code} $ cat /etc/hosts 10.142.55.190 zk1 10.142.55.196 zk2 10.142.55.202 zk3 {code} config in each mathine: {code} $ cat /etc/mesos/zk zk://10.142.55.190:2181,10.142.55.196:2181,10.142.55.202:2181/mesos {code} After start zookeeper, mesos-master and mesos-slave in three vm, I can view the mesos webui(10.142.55.190:5050), but agents count is 0. After a little time, mesos page get error: {code} Failed to connect to 10.142.55.190:5050! Retrying in 16 seconds... {code} (I found that zookeeper would elect a new leader in a short interval) mesos-master cmd: {code} mesos-master --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" --allocation_interval="1secs" --allocator="HierarchicalDRF" --authenticate_agents="false" --authenticate_frameworks="false" --authenticate_http_frameworks="false" --authenticate_http_readonly="false" --authenticate_http_readwrite="false" --authenticators="crammd5" --authorizers="local" --framework_sorter="drf" --help="false" --hostname_lookup="true" --http_authenticators="basic" --initialize_driver_logging="true" --ip="10.142.55.190" --log_auto_initialize="true" --log_dir="/var/log/mesos" --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" --port="5050" --quiet="false" --quorum="2" --recovery_agent_removal_limit="100%" --registry="replicated_log" --registry_fetch_timeout="1mins" --registry_store_timeout="20secs" --registry_strict="false" --root_submissions="true" --user_sorter="drf" --version="false" --webui_dir="/usr/share/mesos/webui" --work_dir="/var/lib/mesos" --zk="zk://10.142.55.190:2181,10.142.55.196:2181,10.142.55.202:2181/mesos" {code} mesos-slave cmd: {code} mesos-slave --appc_simple_discovery_uri_prefix="http://"; --appc_store_dir="/tmp/mesos/store/appc" --authenticate_http_readonly="false" --authenticate_http_readwrite="false" --authenticatee="crammd5" --authentication_backoff_factor="1secs" --authorizer="local" --cgroups_cpu_enable_pids_and_tids_count="false" --cgroups_enable_cfs="false" --cgroups_hierarchy="/sys/fs/cgroup" --cgroups_limit_swap="false" --cgroups_root="mesos" --container_disk_watch_interval="15secs" --containerizers="mesos" --default_role="*" --disk_watch_interval="1mins" --docker="docker" --docker_kill_orphans="true" --docker_registry="https://registry-1.docker.io"; --docker_remove_delay="6hrs" --docker_socket="/var/run/docker.sock" --docker_stop_timeout="0ns" --docker_store_dir="/tmp/mesos/store/docker" --docker_volume_checkpoint_dir="/var/run/mesos/isolators/docker/volume" --enforce_container_disk_quota="false" --executor_registration_timeout="1mins" --executor_shutdown_grace_period="5secs" --fetcher_cache_dir="/tmp/mesos/fetch" --fetcher_cache_size="2GB" --frameworks_home="" --gc_delay="1weeks" --gc_disk_headroom="0.1" --hadoop_home="" --help="false" --hostname="10.142.55.190" --hostname_lookup="true" --http_authenticators="basic" --http_command_executor="false" --image_provisioner_backend="copy" --initialize_driver_logging="true" --ip="10.142.55.190" --isolation="posix/cpu,posix/mem" --launcher="posix" --launcher_dir="/usr/libexec/mesos" --log_dir="/var/log/mesos" --logbufsecs="0" --logging_level="INFO" --master="zk://10.142.55.190:2181,10.142.55.196:2181,10.142.55.202:2181/mesos" --oversubscribed_resources_interval="15secs" --perf_duration="10secs" --perf_interval="1mins" --port="5051" --qos_correction_interval_min="0ns" --quiet="false" --recover="reconnect" --recovery_timeout="15mins" --registration_backoff_factor="1secs" --revocable_cpu_low_priority="true" --sandbox_directory="/mnt/mesos/sandbox" --strict="true" --switch_user="true" --systemd_enable_support="true" --systemd_runtime_directory="/run/systemd/system" --version="false" --work_dir="/var/lib/mesos" {code} When I run mesos-master from command-line, I got {code} I0919 17:20:19.286264 17550 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (583)@10.142.55.202:5050 F0919 17:20:20.009371 17556 master.cpp:1536] Recovery failed: Failed to recover registrar: Failed to perform fetch within 1mins *** Check failure stack trace: *** @ 0x7f9db78458dd google::LogMessage::Fail() @ 0x7f9db784771d google::LogMessage::SendToLog() @ 0x7f9db78454cc google::LogMessage::Flush() @ 0x7f9db7848019 google::LogMessageFatal::~LogMessageFatal() @ 0x7f9db6e2dbbc mesos::internal::master::fail() @
[jira] [Updated] (MESOS-6127) Implement suppport for HTTP/2
[ https://issues.apache.org/jira/browse/MESOS-6127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aaron Wood updated MESOS-6127: -- Description: HTTP/2 will allow us to take advantage of connection multiplexing, header compression, streams, server push, etc. Add support for communication over HTTP/2 between masters and agents, framework endpoints, etc. Should we support HTTP/2 without TLS? The spec allows for this but most major browser vendors, libraries, and implementations aren't supporting it unless TLS is used. If we do require TLS, what can be done to reduce the performance hit of the TLS handshake? Might need to change more code to make sure that we are taking advantage of connection sharing so that we can (ideally) only ever have a one-time TLS handshake per shared connection. Some ideas for libs: https://nghttp2.org/documentation/package_README.html - Has encoders/decoders supporting HPACK https://nghttp2.org/documentation/tutorial-hpack.html https://nghttp2.org/documentation/libnghttp2_asio.html - Currently marked as experimental by the nghttp2 docs was: HTTP/2 will allow us to take advantage of connection multiplexing, header compression, streams, server push, etc. Add support for communication over HTTP/2 between masters and agents, framework endpoints, etc. Should we support HTTP/2 without TLS? The spec allows for this but most major browser vendors, libraries, and implementations aren't supporting it unless TLS is used. If we do require TLS, what can be done to reduce the performance hit of the TLS handshake? Might need to change more code to make sure that we are taking advantage of connection sharing so that we can (ideally) only ever have a one-time TLS handshake per shared connection. > Implement suppport for HTTP/2 > - > > Key: MESOS-6127 > URL: https://issues.apache.org/jira/browse/MESOS-6127 > Project: Mesos > Issue Type: Epic > Components: HTTP API, libprocess >Reporter: Aaron Wood > Labels: performance > > HTTP/2 will allow us to take advantage of connection multiplexing, header > compression, streams, server push, etc. Add support for communication over > HTTP/2 between masters and agents, framework endpoints, etc. > Should we support HTTP/2 without TLS? The spec allows for this but most major > browser vendors, libraries, and implementations aren't supporting it unless > TLS is used. If we do require TLS, what can be done to reduce the performance > hit of the TLS handshake? Might need to change more code to make sure that we > are taking advantage of connection sharing so that we can (ideally) only ever > have a one-time TLS handshake per shared connection. > Some ideas for libs: > https://nghttp2.org/documentation/package_README.html - Has encoders/decoders > supporting HPACK https://nghttp2.org/documentation/tutorial-hpack.html > https://nghttp2.org/documentation/libnghttp2_asio.html - Currently marked as > experimental by the nghttp2 docs -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6202) Docker containerizer kills containers whose name starts with 'mesos-'
[ https://issues.apache.org/jira/browse/MESOS-6202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15504216#comment-15504216 ] Anand Mazumdar commented on MESOS-6202: --- We undertook the work to clean up orphaned docker containers correctly as part of MESOS-3573. Unfortunately, this modified behavior should have been part of the 1.0 {{CHANGELOG}} but somehow was missed. As [~haosd...@gmail.com] suggested, you can use the {{docker_kill_orphans}} to get the previous behavior. We currently don't look to see if the {{id}} also has a valid UUID as you had pointed out. It seems orthogonal to this issue though. [~h0tbird] Can you file a separate issue for that? > Docker containerizer kills containers whose name starts with 'mesos-' > - > > Key: MESOS-6202 > URL: https://issues.apache.org/jira/browse/MESOS-6202 > Project: Mesos > Issue Type: Bug > Components: containerization, docker >Affects Versions: 1.0.1 > Environment: Dockerized > {{mesosphere/mesos-slave:1.0.1-2.0.93.ubuntu1404}} >Reporter: Marc Villacorta > > I run 3 docker containers in my CoreOS system whose names start with > _'mesos-'_ those are: _'mesos-master'_, _'mesos-dns'_ and _'mesos-agent'_. > I can start the first two without any problem but when I start the third one > _('mesos-agent')_ all three containers are killed by the docker daemon. > If I rename the containers to _'m3s0s-master'_, _'m3s0s-dns'_ and > _'m3s0s-agent'_ everything works. > I tracked down the problem to > [this|https://github.com/apache/mesos/blob/16a563aca1f226b021b8f8815c4d115a3212f02b/src/slave/containerizer/docker.cpp#L116-L120] > code which is marked to be removed after deprecation cycle. > I was previously running Mesos 0.28.2 without this problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6202) Docker containerizer kills containers whose name starts with 'mesos-'
[ https://issues.apache.org/jira/browse/MESOS-6202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15504163#comment-15504163 ] Joseph Wu commented on MESOS-6202: -- {quote} I was previously running Mesos 0.28.2 without this problem. {quote} This code has been unchanged since 0.23, so you should be hitting the same problem regardless of what version you are using. > Docker containerizer kills containers whose name starts with 'mesos-' > - > > Key: MESOS-6202 > URL: https://issues.apache.org/jira/browse/MESOS-6202 > Project: Mesos > Issue Type: Bug > Components: containerization, docker >Affects Versions: 1.0.1 > Environment: Dockerized > {{mesosphere/mesos-slave:1.0.1-2.0.93.ubuntu1404}} >Reporter: Marc Villacorta > > I run 3 docker containers in my CoreOS system whose names start with > _'mesos-'_ those are: _'mesos-master'_, _'mesos-dns'_ and _'mesos-agent'_. > I can start the first two without any problem but when I start the third one > _('mesos-agent')_ all three containers are killed by the docker daemon. > If I rename the containers to _'m3s0s-master'_, _'m3s0s-dns'_ and > _'m3s0s-agent'_ everything works. > I tracked down the problem to > [this|https://github.com/apache/mesos/blob/16a563aca1f226b021b8f8815c4d115a3212f02b/src/slave/containerizer/docker.cpp#L116-L120] > code which is marked to be removed after deprecation cycle. > I was previously running Mesos 0.28.2 without this problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5909) Stout "OsTest.User" test can fail on some systems
[ https://issues.apache.org/jira/browse/MESOS-5909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15504128#comment-15504128 ] Joseph Wu commented on MESOS-5909: -- We prefer ReviewBoard for all C++ and non-trivial changes. This is partly for Apache-Foundation-recommended reasons (they want to keep records of everything) and because the Github repo is a read-only mirror :) > Stout "OsTest.User" test can fail on some systems > - > > Key: MESOS-5909 > URL: https://issues.apache.org/jira/browse/MESOS-5909 > Project: Mesos > Issue Type: Bug > Components: stout >Reporter: Kapil Arya >Assignee: Gilbert Song > Labels: mesosphere > Attachments: MESOS-5909-fix.diff > > > Libc call {{getgrouplist}} doesn't return the {{gid}} list in a sorted manner > (in my case, it's returning "471 100") ... whereas {{id -G}} return a sorted > list ("100 471" in my case) causing the validation inside the loop to fail. > We should sort both lists before comparing the values. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6205) mesos-master can not found mesos-slave, and elect a new leader in a short interval
[ https://issues.apache.org/jira/browse/MESOS-6205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15504119#comment-15504119 ] Joseph Wu commented on MESOS-6205: -- There are two repeating log messages that tell you (indirectly) that something is wrong: {code} I0919 15:55:08.178272 13280 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (14)@10.142.55.202:5050 {code} This message means that you've started this master before, with the same work directory. It has some sort of persistent state in its work directory. This log message tells you that there are two masters you have *not* started before: {code} I0919 15:55:16.018023 13282 consensus.cpp:360] Aborting implicit promise request because 2 ignores received {code} The masters will refuse to start because there is less than a quorum of masters with the persistent state. If the masters were to start, you would have potential data loss. This is the expected behavior, as Mesos errs on the side of caution. I'm assuming you want a fresh cluster (no prior state); you can fix this by deleting the work directory of the master on the {{10.142.55.202}} node. If none of the masters have any prior state, they will reach consensus. > mesos-master can not found mesos-slave, and elect a new leader in a short > interval > -- > > Key: MESOS-6205 > URL: https://issues.apache.org/jira/browse/MESOS-6205 > Project: Mesos > Issue Type: Bug > Components: master > Environment: ubuntu 12 x64, centos 6.5 x64, centos 7.2 x64 >Reporter: kasim > > I follow this > [doc][https://open.mesosphere.com/getting-started/install/#verifying-installation] > to setup mesos cluster. > There are three vm(ubuntu 12, centos 6.5, centos 7.2). > $ cat /etc/hosts > 10.142.55.190 zk1 > 10.142.55.196 zk2 > 10.142.55.202 zk3 > config in each mathine: > $ cat /etc/mesos/zk > zk://10.142.55.190:2181,10.142.55.196:2181,10.142.55.202:2181/mesos > > After start zookeeper, mesos-master and mesos-slave in three vm, I can view > the mesos webui(10.142.55.190:5050), but agents count is 0. > After a little time, mesos page get error: > Failed to connect to 10.142.55.190:5050! > Retrying in 16 seconds... > (I found that zookeeper would elect a new leader in a short interval) > > mesos-master cmd: > ``` > mesos-master --agent_ping_timeout="15secs" > --agent_reregister_timeout="10mins" --allocation_interval="1secs" > --allocator="HierarchicalDRF" --authenticate_agents="false" > --authenticate_frameworks="false" --authenticate_http_frameworks="false" > --authenticate_http_readonly="false" --authenticate_http_readwrite="false" > --authenticators="crammd5" --authorizers="local" --framework_sorter="drf" > --help="false" --hostname_lookup="true" --http_authenticators="basic" > --initialize_driver_logging="true" --ip="10.142.55.190" > --log_auto_initialize="true" --log_dir="/var/log/mesos" --logbufsecs="0" > --logging_level="INFO" --max_agent_ping_timeouts="5" > --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" > --port="5050" --quiet="false" --quorum="2" > --recovery_agent_removal_limit="100%" --registry="replicated_log" > --registry_fetch_timeout="1mins" --registry_store_timeout="20secs" > --registry_strict="false" --root_submissions="true" --user_sorter="drf" > --version="false" --webui_dir="/usr/share/mesos/webui" > --work_dir="/var/lib/mesos" > --zk="zk://10.142.55.190:2181,10.142.55.196:2181,10.142.55.202:2181/mesos" > ``` > mesos-slave cmd: > ``` > mesos-slave --appc_simple_discovery_uri_prefix="http://"; > --appc_store_dir="/tmp/mesos/store/appc" --authenticate_http_readonly="false" > --authenticate_http_readwrite="false" --authenticatee="crammd5" > --authentication_backoff_factor="1secs" --authorizer="local" > --cgroups_cpu_enable_pids_and_tids_count="false" --cgroups_enable_cfs="false" > --cgroups_hierarchy="/sys/fs/cgroup" --cgroups_limit_swap="false" > --cgroups_root="mesos" --container_disk_watch_interval="15secs" > --containerizers="mesos" --default_role="*" --disk_watch_interval="1mins" > --docker="docker" --docker_kill_orphans="true" > --docker_registry="https://registry-1.docker.io"; --docker_remove_delay="6hrs" > --docker_socket="/var/run/docker.sock" --docker_stop_timeout="0ns" > --docker_store_dir="/tmp/mesos/store/docker" > --docker_volume_checkpoint_dir="/var/run/mesos/isolators/docker/volume" > --enforce_container_disk_quota="false" > --executor_registration_timeout="1mins" > --executor_shutdown_grace_period="5secs" > --fetcher_cache_dir="/tmp/mesos/fetch" --fetcher_cache_size="2GB" > --frameworks_home="" --gc_delay="1weeks" --gc_disk_headroom="
[jira] [Comment Edited] (MESOS-6180) Several tests are flaky, with futures timing out early
[ https://issues.apache.org/jira/browse/MESOS-6180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15500170#comment-15500170 ] haosdent edited comment on MESOS-6180 at 9/19/16 4:13 PM: -- Try to reproduce with stress in an aws instance (16 cpus, 30 gb mem, Ubuntu 14.04, c4.4xlarge), but could not reproduce after 5429 iterations as well. was (Author: haosd...@gmail.com): Try to reproduce with stress in an aws instance (16 cpus, 32 gb mem, Ubuntu 14.04), but could not reproduce after 5429 iterations as well. > Several tests are flaky, with futures timing out early > -- > > Key: MESOS-6180 > URL: https://issues.apache.org/jira/browse/MESOS-6180 > Project: Mesos > Issue Type: Bug > Components: tests >Reporter: Greg Mann >Assignee: haosdent > Labels: mesosphere, tests > Attachments: CGROUPS_ROOT_PidNamespaceBackward.log, > CGROUPS_ROOT_PidNamespaceForward.log, FetchAndStoreAndStoreAndFetch.log, > flaky-containerizer-pid-namespace-backward.txt, > flaky-containerizer-pid-namespace-forward.txt > > > Following the merging of a large patch chain, it was noticed on our internal > CI that several tests had become flaky, with a similar pattern in the > failures: the tests fail early when a future times out. Often, this occurs > when a test cluster is being spun up and one of the offer futures times out. > This has been observed in the following tests: > * MesosContainerizerSlaveRecoveryTest.CGROUPS_ROOT_PidNamespaceForward > * MesosContainerizerSlaveRecoveryTest.CGROUPS_ROOT_PidNamespaceBackward > * ZooKeeperStateTest.FetchAndStoreAndStoreAndFetch > * RoleTest.ImplicitRoleRegister > * SlaveRecoveryTest/0.MultipleFrameworks > * SlaveRecoveryTest/0.ReconcileShutdownFramework > * SlaveTest.ContainerizerUsageFailure > * MesosSchedulerDriverTest.ExplicitAcknowledgements > * SlaveRecoveryTest/0.ReconnectHTTPExecutor (MESOS-6164) > * ResourceOffersTest.ResourcesGetReofferedAfterTaskInfoError (MESOS-6165) > * SlaveTest.CommandTaskWithKillPolicy (MESOS-6166) > See the linked JIRAs noted above for individual tickets addressing a couple > of these. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (MESOS-6159) Remove stout's Set type
[ https://issues.apache.org/jira/browse/MESOS-6159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Bannier reassigned MESOS-6159: --- Assignee: Benjamin Bannier > Remove stout's Set type > --- > > Key: MESOS-6159 > URL: https://issues.apache.org/jira/browse/MESOS-6159 > Project: Mesos > Issue Type: Bug > Components: stout >Reporter: Benjamin Bannier >Assignee: Benjamin Bannier >Priority: Minor > Labels: tech-debt > > stout provides a {{Set}} type which wraps a {{std::set}}. As only addition it > provides new constructors, > {code} > Set(const T& t1); > Set(const T& t1, const T& t2); > Set(const T& t1, const T& t2, const T& t3); > Set(const T& t1, const T& t2, const T& t3, const T& t4); > {code} > which simplified creation of a {{Set}} from (up to four) known elements. > C++11 brought {{std::initializer_list}} which can be used to create a > {{std::set}} from an arbitrary number of elements, so it appears that it > should be possible to retire {{Set}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-6209) Containers that use the Mesos containerizer but don't want to provision a container image fail to validate.
Jan Schlicht created MESOS-6209: --- Summary: Containers that use the Mesos containerizer but don't want to provision a container image fail to validate. Key: MESOS-6209 URL: https://issues.apache.org/jira/browse/MESOS-6209 Project: Mesos Issue Type: Bug Components: containerization Environment: Mesos HEAD, change was introduced with e65f580bf0cbea64cedf521cf169b9b4c9f85454 Reporter: Jan Schlicht Tasks using features like volumes or CNI in their containers, have to define these in {{TaskInfo.container}}. When these tasks don't want/need to provision a container image, neither {{ContainerInfo.docker}} nor {{ContainerInfo.mesos}} will be set. Nevertheless, the container type in {{ContainerInfo.type}} needs to be set, because it is a required field. In that case, the recently introduced validation rules in {{master/validation.cpp}} ({{validateContainerInfo}} will fail, which isn't expected. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-6208) Containers that use the Mesos containerizer but don't want to provision a container image fail to validate.
Jan Schlicht created MESOS-6208: --- Summary: Containers that use the Mesos containerizer but don't want to provision a container image fail to validate. Key: MESOS-6208 URL: https://issues.apache.org/jira/browse/MESOS-6208 Project: Mesos Issue Type: Bug Components: containerization Environment: Mesos HEAD, change was introduced with e65f580bf0cbea64cedf521cf169b9b4c9f85454 Reporter: Jan Schlicht Tasks using features like volumes or CNI in their containers, have to define these in {{TaskInfo.container}}. When these tasks don't want/need to provision a container image, neither {{ContainerInfo.docker}} nor {{ContainerInfo.mesos}} will be set. Nevertheless, the container type in {{ContainerInfo.type}} needs to be set, because it is a required field. In that case, the recently introduced validation rules in {{master/validation.cpp}} ({{validateContainerInfo}} will fail, which isn't expected. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-6207) Python bindings fail to build with custom SVN installation path
[ https://issues.apache.org/jira/browse/MESOS-6207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15503630#comment-15503630 ] Ilya Pronin edited comment on MESOS-6207 at 9/19/16 2:23 PM: - Proposed solution is to move Python related configuration steps down after all others. If that's OK I'd like to post a patch when my {{contributors.yaml}} PR will be merged. was (Author: ipronin): Proposed solution is to move Python related configuration steps down after all others. If that's OK I'd like to post a patch when my {{contributors.yaml}} PR will get merged. > Python bindings fail to build with custom SVN installation path > --- > > Key: MESOS-6207 > URL: https://issues.apache.org/jira/browse/MESOS-6207 > Project: Mesos > Issue Type: Bug > Components: build >Affects Versions: 1.0.1 >Reporter: Ilya Pronin >Priority: Trivial > > In {{src/Makefile.am}} {{PYTHON_LDFLAGS}} variable is used while building > Python bindings. This variable picks {{LDFLAGS}} during configuration phase > before we check for custom SVN installation path and misses > {{-L$\{with_svn\}/lib}} flag. That causes a link error on systems with > uncommon SVN installation path. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6207) Python bindings fail to build with custom SVN installation path
[ https://issues.apache.org/jira/browse/MESOS-6207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15503630#comment-15503630 ] Ilya Pronin commented on MESOS-6207: Proposed solution is to move Python related configuration steps down after all others. If that's OK I'd like to post a patch when my {{contributors.yaml}} PR will get merged. > Python bindings fail to build with custom SVN installation path > --- > > Key: MESOS-6207 > URL: https://issues.apache.org/jira/browse/MESOS-6207 > Project: Mesos > Issue Type: Bug > Components: build >Affects Versions: 1.0.1 >Reporter: Ilya Pronin >Priority: Trivial > > In {{src/Makefile.am}} {{PYTHON_LDFLAGS}} variable is used while building > Python bindings. This variable picks {{LDFLAGS}} during configuration phase > before we check for custom SVN installation path and misses > {{-L$\{with_svn\}/lib}} flag. That causes a link error on systems with > uncommon SVN installation path. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-5275) Add capabilities support for unified containerizer.
[ https://issues.apache.org/jira/browse/MESOS-5275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15386728#comment-15386728 ] Benjamin Bannier edited comment on MESOS-5275 at 9/19/16 2:21 PM: -- Reviews: https://reviews.apache.org/r/50271 https://reviews.apache.org/r/51930 https://reviews.apache.org/r/51931 was (Author: bbannier): Review: https://reviews.apache.org/r/50271/ > Add capabilities support for unified containerizer. > --- > > Key: MESOS-5275 > URL: https://issues.apache.org/jira/browse/MESOS-5275 > Project: Mesos > Issue Type: Task > Components: containerization >Reporter: Jojy Varghese >Assignee: Benjamin Bannier > Labels: mesosphere > > Add capabilities support for unified containerizer. > Requirements: > 1. Use the mesos capabilities API. > 2. Frameworks be able to add capability requests for containers. > 3. Agents be able to add maximum allowed capabilities for all containers > launched. > Design document: > https://docs.google.com/document/d/1YiTift8TQla2vq3upQr7K-riQ_pQ-FKOCOsysQJROGc/edit#heading=h.rgfwelqrskmd -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-6207) Python bindings fail to build with custom SVN installation path
Ilya Pronin created MESOS-6207: -- Summary: Python bindings fail to build with custom SVN installation path Key: MESOS-6207 URL: https://issues.apache.org/jira/browse/MESOS-6207 Project: Mesos Issue Type: Bug Components: build Affects Versions: 1.0.1 Reporter: Ilya Pronin Priority: Trivial In {{src/Makefile.am}} {{PYTHON_LDFLAGS}} variable is used while building Python bindings. This variable picks {{LDFLAGS}} during configuration phase before we check for custom SVN installation path and misses {{-L$\{with_svn\}/lib}} flag. That causes a link error on systems with uncommon SVN installation path. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (MESOS-5320) SSL related error messages can be misguiding or incomplete
[ https://issues.apache.org/jira/browse/MESOS-5320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Till Toenshoff reassigned MESOS-5320: - Assignee: Till Toenshoff > SSL related error messages can be misguiding or incomplete > -- > > Key: MESOS-5320 > URL: https://issues.apache.org/jira/browse/MESOS-5320 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.0.0 >Reporter: Till Toenshoff >Assignee: Till Toenshoff > Labels: ssl > > I was trying to activate SSL within Mesos but had rendered an invalid > certificate, it was signed with a mismatching key. Once I started the master, > the error message I received was rather confusing to me: > {noformat} > W0503 10:15:58.027343 6696 openssl.cpp:363] Failed SSL connections will be > downgraded to a non-SSL socket > Could not load key file > {noformat} > To me, this error message hinted that the key file was not existing or had > rights issues. However, a quick {{strace}} revealed that the key-file was > properly accessed, no sign of a file-not-found or alike. > The problem here is the hardcoded error-message, not taking OpenSSL's human > readable error strings into account. > The code that misguided me is located at > https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/openssl.cpp#L471 > We might want to change > {noformat} > // Set private key. > if (SSL_CTX_use_PrivateKey_file( > ctx, > ssl_flags->key_file.get().c_str(), > SSL_FILETYPE_PEM) != 1) { > EXIT(EXIT_FAILURE) << "Could not load key file"; > } > {noformat} > Towards something like this > {noformat} > // Set private key. > if (SSL_CTX_use_PrivateKey_file( > ctx, > ssl_flags->key_file.get().c_str(), > SSL_FILETYPE_PEM) != 1) { > EXIT(EXIT_FAILURE) << "Could not use key file: " << > ERR_error_string(ERR_get_error(), NULL); > } > {noformat} > To receive a much more helpful message like this > {noformat} > W0503 13:18:12.551364 11572 openssl.cpp:363] Failed SSL connections will be > downgraded to a non-SSL socket > Could not use key file: error:0B080074:x509 certificate > routines:X509_check_private_key:key values mismatch > {noformat} > A quick scan of the implementation within {{openssl.cpp}} to me suggests that > there are more places that we might want to update with more deterministic > error messages. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-5952) Update docs for new PARTITION_AWARE behavior
[ https://issues.apache.org/jira/browse/MESOS-5952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neil Conway updated MESOS-5952: --- Summary: Update docs for new PARTITION_AWARE behavior (was: Update docs for new slave removal behavior) > Update docs for new PARTITION_AWARE behavior > > > Key: MESOS-5952 > URL: https://issues.apache.org/jira/browse/MESOS-5952 > Project: Mesos > Issue Type: Improvement > Components: documentation >Reporter: Neil Conway >Assignee: Neil Conway > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-6206) Change reconciliation to return results for in-progress removals and reregistrations
Neil Conway created MESOS-6206: -- Summary: Change reconciliation to return results for in-progress removals and reregistrations Key: MESOS-6206 URL: https://issues.apache.org/jira/browse/MESOS-6206 Project: Mesos Issue Type: Bug Components: master Reporter: Neil Conway Assignee: Neil Conway The master does not return any reconciliation results for agents it views as "transitioning". An agent is defined as transitioning if any of the following are true: 1. The master recovered from the registry after failover but the agent has not yet reregistered 2. The master is in the process of removing an admitted agent from the registry 3. The master is in the process of re-registering an agent (i.e., re-adding it to the list of admitted agents). I think case #1 makes sense but cases #2 and #3 do not. Before the registry operation completes, we should instead view the slave as still being in its previous state ("admitted" for case 2 and not-admitted/unreachable/etc. for case 3). Reasons to make this change: 1. Improve consistency with output of endpoints, etc.: until the registry operation to remove/re-admit a slave finishes, we show the previous state of the slave in the HTTP endpoints. Returning reconciliation results that are consistent with HTTP endpoint values is sensible. 2. It is simpler. Rather than not sending anything to frameworks and requiring that they ask us again later, it is simpler to just send the current state of the agent. If that state changes (whether due to the registry operation succeeding or a subsequent state change), then the reconciliation results might be stale -- so be it. Such stale information fundamentally cannot be avoided. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6202) Docker containerizer kills containers whose name starts with 'mesos-'
[ https://issues.apache.org/jira/browse/MESOS-6202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15502997#comment-15502997 ] haosdent commented on MESOS-6202: - This requires we update the UUID::fromString to {code} - static UUID fromString(const std::string& s) + static Try fromString(const std::string& s) { -// NOTE: We don't use THREAD_LOCAL for the `string_generator` -// (unlike for the `random_generator` above), because it is cheap -// to construct one each time. -boost::uuids::string_generator gen; -boost::uuids::uuid uuid = gen(s); -return UUID(uuid); +try { + // NOTE: We don't use THREAD_LOCAL for the `string_generator` + // (unlike for the `random_generator` above), because it is cheap + // to construct one each time. + boost::uuids::string_generator gen; + boost::uuids::uuid uuid = gen(s); + return UUID(uuid); +} catch (const std::exception& e) { + return Error("Invalid UUID '" + s + "': " + e.what()); +} catch (...) { + return Error("Invalid UUID '" + s + "': unknown exception."); +} } {code} first. > Docker containerizer kills containers whose name starts with 'mesos-' > - > > Key: MESOS-6202 > URL: https://issues.apache.org/jira/browse/MESOS-6202 > Project: Mesos > Issue Type: Bug > Components: containerization, docker >Affects Versions: 1.0.1 > Environment: Dockerized > {{mesosphere/mesos-slave:1.0.1-2.0.93.ubuntu1404}} >Reporter: Marc Villacorta > > I run 3 docker containers in my CoreOS system whose names start with > _'mesos-'_ those are: _'mesos-master'_, _'mesos-dns'_ and _'mesos-agent'_. > I can start the first two without any problem but when I start the third one > _('mesos-agent')_ all three containers are killed by the docker daemon. > If I rename the containers to _'m3s0s-master'_, _'m3s0s-dns'_ and > _'m3s0s-agent'_ everything works. > I tracked down the problem to > [this|https://github.com/apache/mesos/blob/16a563aca1f226b021b8f8815c4d115a3212f02b/src/slave/containerizer/docker.cpp#L116-L120] > code which is marked to be removed after deprecation cycle. > I was previously running Mesos 0.28.2 without this problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6205) mesos-master can not found mesos-slave, and elect a new leader in a short interval
[ https://issues.apache.org/jira/browse/MESOS-6205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15502887#comment-15502887 ] haosdent commented on MESOS-6205: - hi, [~mithril] You need to make sure every master could talk to each other. According to your logs, your master still not finish leader selection. > mesos-master can not found mesos-slave, and elect a new leader in a short > interval > -- > > Key: MESOS-6205 > URL: https://issues.apache.org/jira/browse/MESOS-6205 > Project: Mesos > Issue Type: Bug > Components: master > Environment: ubuntu 12 x64, centos 6.5 x64, centos 7.2 x64 >Reporter: kasim > > I follow this > [doc][https://open.mesosphere.com/getting-started/install/#verifying-installation] > to setup mesos cluster. > There are three vm(ubuntu 12, centos 6.5, centos 7.2). > $ cat /etc/hosts > 10.142.55.190 zk1 > 10.142.55.196 zk2 > 10.142.55.202 zk3 > config in each mathine: > $ cat /etc/mesos/zk > zk://10.142.55.190:2181,10.142.55.196:2181,10.142.55.202:2181/mesos > > After start zookeeper, mesos-master and mesos-slave in three vm, I can view > the mesos webui(10.142.55.190:5050), but agents count is 0. > After a little time, mesos page get error: > Failed to connect to 10.142.55.190:5050! > Retrying in 16 seconds... > (I found that zookeeper would elect a new leader in a short interval) > > mesos-master cmd: > ``` > mesos-master --agent_ping_timeout="15secs" > --agent_reregister_timeout="10mins" --allocation_interval="1secs" > --allocator="HierarchicalDRF" --authenticate_agents="false" > --authenticate_frameworks="false" --authenticate_http_frameworks="false" > --authenticate_http_readonly="false" --authenticate_http_readwrite="false" > --authenticators="crammd5" --authorizers="local" --framework_sorter="drf" > --help="false" --hostname_lookup="true" --http_authenticators="basic" > --initialize_driver_logging="true" --ip="10.142.55.190" > --log_auto_initialize="true" --log_dir="/var/log/mesos" --logbufsecs="0" > --logging_level="INFO" --max_agent_ping_timeouts="5" > --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" > --port="5050" --quiet="false" --quorum="2" > --recovery_agent_removal_limit="100%" --registry="replicated_log" > --registry_fetch_timeout="1mins" --registry_store_timeout="20secs" > --registry_strict="false" --root_submissions="true" --user_sorter="drf" > --version="false" --webui_dir="/usr/share/mesos/webui" > --work_dir="/var/lib/mesos" > --zk="zk://10.142.55.190:2181,10.142.55.196:2181,10.142.55.202:2181/mesos" > ``` > mesos-slave cmd: > ``` > mesos-slave --appc_simple_discovery_uri_prefix="http://"; > --appc_store_dir="/tmp/mesos/store/appc" --authenticate_http_readonly="false" > --authenticate_http_readwrite="false" --authenticatee="crammd5" > --authentication_backoff_factor="1secs" --authorizer="local" > --cgroups_cpu_enable_pids_and_tids_count="false" --cgroups_enable_cfs="false" > --cgroups_hierarchy="/sys/fs/cgroup" --cgroups_limit_swap="false" > --cgroups_root="mesos" --container_disk_watch_interval="15secs" > --containerizers="mesos" --default_role="*" --disk_watch_interval="1mins" > --docker="docker" --docker_kill_orphans="true" > --docker_registry="https://registry-1.docker.io"; --docker_remove_delay="6hrs" > --docker_socket="/var/run/docker.sock" --docker_stop_timeout="0ns" > --docker_store_dir="/tmp/mesos/store/docker" > --docker_volume_checkpoint_dir="/var/run/mesos/isolators/docker/volume" > --enforce_container_disk_quota="false" > --executor_registration_timeout="1mins" > --executor_shutdown_grace_period="5secs" > --fetcher_cache_dir="/tmp/mesos/fetch" --fetcher_cache_size="2GB" > --frameworks_home="" --gc_delay="1weeks" --gc_disk_headroom="0.1" > --hadoop_home="" --help="false" --hostname="10.142.55.190" > --hostname_lookup="true" --http_authenticators="basic" > --http_command_executor="false" --image_provisioner_backend="copy" > --initialize_driver_logging="true" --ip="10.142.55.190" > --isolation="posix/cpu,posix/mem" --launcher="posix" > --launcher_dir="/usr/libexec/mesos" --log_dir="/var/log/mesos" > --logbufsecs="0" --logging_level="INFO" > --master="zk://10.142.55.190:2181,10.142.55.196:2181,10.142.55.202:2181/mesos" > --oversubscribed_resources_interval="15secs" --perf_duration="10secs" > --perf_interval="1mins" --port="5051" --qos_correction_interval_min="0ns" > --quiet="false" --recover="reconnect" --recovery_timeout="15mins" > --registration_backoff_factor="1secs" --revocable_cpu_low_priority="true" > --sandbox_directory="/mnt/mesos/sandbox" --strict="true" --switch_user="true" > --systemd_enable_support="true" > --systemd_runtim
[jira] [Updated] (MESOS-6205) mesos-master can not found mesos-slave, and elect a new leader in a short interval
[ https://issues.apache.org/jira/browse/MESOS-6205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kasim updated MESOS-6205: - Description: I follow this [doc][https://open.mesosphere.com/getting-started/install/#verifying-installation] to setup mesos cluster. There are three vm(ubuntu 12, centos 6.5, centos 7.2). $ cat /etc/hosts 10.142.55.190 zk1 10.142.55.196 zk2 10.142.55.202 zk3 config in each mathine: $ cat /etc/mesos/zk zk://10.142.55.190:2181,10.142.55.196:2181,10.142.55.202:2181/mesos After start zookeeper, mesos-master and mesos-slave in three vm, I can view the mesos webui(10.142.55.190:5050), but agents count is 0. After a little time, mesos page get error: Failed to connect to 10.142.55.190:5050! Retrying in 16 seconds... (I found that zookeeper would elect a new leader in a short interval) mesos-master cmd: ``` mesos-master --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" --allocation_interval="1secs" --allocator="HierarchicalDRF" --authenticate_agents="false" --authenticate_frameworks="false" --authenticate_http_frameworks="false" --authenticate_http_readonly="false" --authenticate_http_readwrite="false" --authenticators="crammd5" --authorizers="local" --framework_sorter="drf" --help="false" --hostname_lookup="true" --http_authenticators="basic" --initialize_driver_logging="true" --ip="10.142.55.190" --log_auto_initialize="true" --log_dir="/var/log/mesos" --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" --port="5050" --quiet="false" --quorum="2" --recovery_agent_removal_limit="100%" --registry="replicated_log" --registry_fetch_timeout="1mins" --registry_store_timeout="20secs" --registry_strict="false" --root_submissions="true" --user_sorter="drf" --version="false" --webui_dir="/usr/share/mesos/webui" --work_dir="/var/lib/mesos" --zk="zk://10.142.55.190:2181,10.142.55.196:2181,10.142.55.202:2181/mesos" ``` mesos-slave cmd: ``` mesos-slave --appc_simple_discovery_uri_prefix="http://"; --appc_store_dir="/tmp/mesos/store/appc" --authenticate_http_readonly="false" --authenticate_http_readwrite="false" --authenticatee="crammd5" --authentication_backoff_factor="1secs" --authorizer="local" --cgroups_cpu_enable_pids_and_tids_count="false" --cgroups_enable_cfs="false" --cgroups_hierarchy="/sys/fs/cgroup" --cgroups_limit_swap="false" --cgroups_root="mesos" --container_disk_watch_interval="15secs" --containerizers="mesos" --default_role="*" --disk_watch_interval="1mins" --docker="docker" --docker_kill_orphans="true" --docker_registry="https://registry-1.docker.io"; --docker_remove_delay="6hrs" --docker_socket="/var/run/docker.sock" --docker_stop_timeout="0ns" --docker_store_dir="/tmp/mesos/store/docker" --docker_volume_checkpoint_dir="/var/run/mesos/isolators/docker/volume" --enforce_container_disk_quota="false" --executor_registration_timeout="1mins" --executor_shutdown_grace_period="5secs" --fetcher_cache_dir="/tmp/mesos/fetch" --fetcher_cache_size="2GB" --frameworks_home="" --gc_delay="1weeks" --gc_disk_headroom="0.1" --hadoop_home="" --help="false" --hostname="10.142.55.190" --hostname_lookup="true" --http_authenticators="basic" --http_command_executor="false" --image_provisioner_backend="copy" --initialize_driver_logging="true" --ip="10.142.55.190" --isolation="posix/cpu,posix/mem" --launcher="posix" --launcher_dir="/usr/libexec/mesos" --log_dir="/var/log/mesos" --logbufsecs="0" --logging_level="INFO" --master="zk://10.142.55.190:2181,10.142.55.196:2181,10.142.55.202:2181/mesos" --oversubscribed_resources_interval="15secs" --perf_duration="10secs" --perf_interval="1mins" --port="5051" --qos_correction_interval_min="0ns" --quiet="false" --recover="reconnect" --recovery_timeout="15mins" --registration_backoff_factor="1secs" --revocable_cpu_low_priority="true" --sandbox_directory="/mnt/mesos/sandbox" --strict="true" --switch_user="true" --systemd_enable_support="true" --systemd_runtime_directory="/run/systemd/system" --version="false" --work_dir="/var/lib/mesos" ``` When I run mesos-master from command-line, I got ``` I0919 17:20:19.286264 17550 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (583)@10.142.55.202:5050 F0919 17:20:20.009371 17556 master.cpp:1536] Recovery failed: Failed to recover registrar: Failed to perform fetch within 1mins *** Check failure stack trace: *** @ 0x7f9db78458dd google::LogMessage::Fail() @ 0x7f9db784771d google::LogMessage::SendToLog() @ 0x7f9db78454cc google::LogMessage::Flush() @ 0x7f9db7848019 google::LogMessageFatal::~LogMessageFatal() @ 0x7f9db6e2dbbc mesos::internal::master::fail() @ 0x7f9db6e75b20 _ZNSt17_Function_handlerIFvRKSsEZNK7proce
[jira] [Updated] (MESOS-6205) mesos-master can not found mesos-slave, and elect a new leader in a short interval
[ https://issues.apache.org/jira/browse/MESOS-6205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kasim updated MESOS-6205: - Description: I follow this [doc][https://open.mesosphere.com/getting-started/install/#verifying-installation] to setup mesos cluster. There are three vm(ubuntu 12, centos 6.5, centos 7.2). $ cat /etc/hosts 10.142.55.190 zk1 10.142.55.196 zk2 10.142.55.202 zk3 config in each mathine: $ cat /etc/mesos/zk zk://10.142.55.190:2181,10.142.55.196:2181,10.142.55.202:2181/mesos After start zookeeper, mesos-master and mesos-slave in three vm, I can view the mesos webui(10.142.55.190:5050), but agents count is 0. After a little time, mesos page get error: Failed to connect to 10.142.55.190:5050! Retrying in 16 seconds... (I found that zookeeper would elect a new leader in a short interval) mesos-master cmd: ``` mesos-master --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" --allocation_interval="1secs" --allocator="HierarchicalDRF" --authenticate_agents="false" --authenticate_frameworks="false" --authenticate_http_frameworks="false" --authenticate_http_readonly="false" --authenticate_http_readwrite="false" --authenticators="crammd5" --authorizers="local" --framework_sorter="drf" --help="false" --hostname_lookup="true" --http_authenticators="basic" --initialize_driver_logging="true" --ip="10.142.55.190" --log_auto_initialize="true" --log_dir="/var/log/mesos" --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" --port="5050" --quiet="false" --quorum="2" --recovery_agent_removal_limit="100%" --registry="replicated_log" --registry_fetch_timeout="1mins" --registry_store_timeout="20secs" --registry_strict="false" --root_submissions="true" --user_sorter="drf" --version="false" --webui_dir="/usr/share/mesos/webui" --work_dir="/var/lib/mesos" --zk="zk://10.142.55.190:2181,10.142.55.196:2181,10.142.55.202:2181/mesos" ``` mesos-slave cmd: ``` mesos-slave --appc_simple_discovery_uri_prefix="http://"; --appc_store_dir="/tmp/mesos/store/appc" --authenticate_http_readonly="false" --authenticate_http_readwrite="false" --authenticatee="crammd5" --authentication_backoff_factor="1secs" --authorizer="local" --cgroups_cpu_enable_pids_and_tids_count="false" --cgroups_enable_cfs="false" --cgroups_hierarchy="/sys/fs/cgroup" --cgroups_limit_swap="false" --cgroups_root="mesos" --container_disk_watch_interval="15secs" --containerizers="mesos" --default_role="*" --disk_watch_interval="1mins" --docker="docker" --docker_kill_orphans="true" --docker_registry="https://registry-1.docker.io"; --docker_remove_delay="6hrs" --docker_socket="/var/run/docker.sock" --docker_stop_timeout="0ns" --docker_store_dir="/tmp/mesos/store/docker" --docker_volume_checkpoint_dir="/var/run/mesos/isolators/docker/volume" --enforce_container_disk_quota="false" --executor_registration_timeout="1mins" --executor_shutdown_grace_period="5secs" --fetcher_cache_dir="/tmp/mesos/fetch" --fetcher_cache_size="2GB" --frameworks_home="" --gc_delay="1weeks" --gc_disk_headroom="0.1" --hadoop_home="" --help="false" --hostname="10.142.55.190" --hostname_lookup="true" --http_authenticators="basic" --http_command_executor="false" --image_provisioner_backend="copy" --initialize_driver_logging="true" --ip="10.142.55.190" --isolation="posix/cpu,posix/mem" --launcher="posix" --launcher_dir="/usr/libexec/mesos" --log_dir="/var/log/mesos" --logbufsecs="0" --logging_level="INFO" --master="zk://10.142.55.190:2181,10.142.55.196:2181,10.142.55.202:2181/mesos" --oversubscribed_resources_interval="15secs" --perf_duration="10secs" --perf_interval="1mins" --port="5051" --qos_correction_interval_min="0ns" --quiet="false" --recover="reconnect" --recovery_timeout="15mins" --registration_backoff_factor="1secs" --revocable_cpu_low_priority="true" --sandbox_directory="/mnt/mesos/sandbox" --strict="true" --switch_user="true" --systemd_enable_support="true" --systemd_runtime_directory="/run/systemd/system" --version="false" --work_dir="/var/lib/mesos" ``` When I run mesos-master from command-line, I got ``` I0919 17:20:19.286264 17550 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (583)@10.142.55.202:5050 F0919 17:20:20.009371 17556 master.cpp:1536] Recovery failed: Failed to recover registrar: Failed to perform fetch within 1mins *** Check failure stack trace: *** @ 0x7f9db78458dd google::LogMessage::Fail() @ 0x7f9db784771d google::LogMessage::SendToLog() @ 0x7f9db78454cc google::LogMessage::Flush() @ 0x7f9db7848019 google::LogMessageFatal::~LogMessageFatal() @ 0x7f9db6e2dbbc mesos::internal::master::fail() @ 0x7f9db6e75b20 _ZNSt17_Function_handlerIFvRKSsEZNK7proce
[jira] [Updated] (MESOS-6205) mesos-master can not found mesos-slave, and elect a new leader in a short interval
[ https://issues.apache.org/jira/browse/MESOS-6205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kasim updated MESOS-6205: - Description: I follow this [doc][https://open.mesosphere.com/getting-started/install/#verifying-installation] to setup mesos cluster. There are three vm(ubuntu 12, centos 6.5, centos 7.2). $ cat /etc/hosts 10.142.55.190 zk1 10.142.55.196 zk2 10.142.55.202 zk3 config in each mathine: $ cat /etc/mesos/zk zk://10.142.55.190:2181,10.142.55.196:2181,10.142.55.202:2181/mesos After start zookeeper, mesos-master and mesos-slave in three vm, I can view the mesos webui(10.142.55.190:5050), but agents count is 0. After a little time, mesos page get error: Failed to connect to 10.142.55.190:5050! Retrying in 16 seconds... (I found that zookeeper would elect a new leader in a short interval) mesos-master cmd: ``` mesos-master --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" --allocation_interval="1secs" --allocator="HierarchicalDRF" --authenticate_agents="false" --authenticate_frameworks="false" --authenticate_http_frameworks="false" --authenticate_http_readonly="false" --authenticate_http_readwrite="false" --authenticators="crammd5" --authorizers="local" --framework_sorter="drf" --help="false" --hostname_lookup="true" --http_authenticators="basic" --initialize_driver_logging="true" --ip="10.142.55.190" --log_auto_initialize="true" --log_dir="/var/log/mesos" --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" --port="5050" --quiet="false" --quorum="2" --recovery_agent_removal_limit="100%" --registry="replicated_log" --registry_fetch_timeout="1mins" --registry_store_timeout="20secs" --registry_strict="false" --root_submissions="true" --user_sorter="drf" --version="false" --webui_dir="/usr/share/mesos/webui" --work_dir="/var/lib/mesos" --zk="zk://10.142.55.190:2181,10.142.55.196:2181,10.142.55.202:2181/mesos" ``` mesos-slave cmd: ``` mesos-slave --appc_simple_discovery_uri_prefix="http://"; --appc_store_dir="/tmp/mesos/store/appc" --authenticate_http_readonly="false" --authenticate_http_readwrite="false" --authenticatee="crammd5" --authentication_backoff_factor="1secs" --authorizer="local" --cgroups_cpu_enable_pids_and_tids_count="false" --cgroups_enable_cfs="false" --cgroups_hierarchy="/sys/fs/cgroup" --cgroups_limit_swap="false" --cgroups_root="mesos" --container_disk_watch_interval="15secs" --containerizers="mesos" --default_role="*" --disk_watch_interval="1mins" --docker="docker" --docker_kill_orphans="true" --docker_registry="https://registry-1.docker.io"; --docker_remove_delay="6hrs" --docker_socket="/var/run/docker.sock" --docker_stop_timeout="0ns" --docker_store_dir="/tmp/mesos/store/docker" --docker_volume_checkpoint_dir="/var/run/mesos/isolators/docker/volume" --enforce_container_disk_quota="false" --executor_registration_timeout="1mins" --executor_shutdown_grace_period="5secs" --fetcher_cache_dir="/tmp/mesos/fetch" --fetcher_cache_size="2GB" --frameworks_home="" --gc_delay="1weeks" --gc_disk_headroom="0.1" --hadoop_home="" --help="false" --hostname="10.142.55.190" --hostname_lookup="true" --http_authenticators="basic" --http_command_executor="false" --image_provisioner_backend="copy" --initialize_driver_logging="true" --ip="10.142.55.190" --isolation="posix/cpu,posix/mem" --launcher="posix" --launcher_dir="/usr/libexec/mesos" --log_dir="/var/log/mesos" --logbufsecs="0" --logging_level="INFO" --master="zk://10.142.55.190:2181,10.142.55.196:2181,10.142.55.202:2181/mesos" --oversubscribed_resources_interval="15secs" --perf_duration="10secs" --perf_interval="1mins" --port="5051" --qos_correction_interval_min="0ns" --quiet="false" --recover="reconnect" --recovery_timeout="15mins" --registration_backoff_factor="1secs" --revocable_cpu_low_priority="true" --sandbox_directory="/mnt/mesos/sandbox" --strict="true" --switch_user="true" --systemd_enable_support="true" --systemd_runtime_directory="/run/systemd/system" --version="false" --work_dir="/var/lib/mesos" ``` When I run mesos-master from command-line, I got ``` I0919 17:20:19.286264 17550 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (583)@10.142.55.202:5050 F0919 17:20:20.009371 17556 master.cpp:1536] Recovery failed: Failed to recover registrar: Failed to perform fetch within 1mins *** Check failure stack trace: *** @ 0x7f9db78458dd google::LogMessage::Fail() @ 0x7f9db784771d google::LogMessage::SendToLog() @ 0x7f9db78454cc google::LogMessage::Flush() @ 0x7f9db7848019 google::LogMessageFatal::~LogMessageFatal() @ 0x7f9db6e2dbbc mesos::internal::master::fail() @ 0x7f9db6e75b20 _ZNSt17_Function_handlerIFvRKSsEZNK7proce
[jira] [Updated] (MESOS-6205) mesos-master can not found mesos-slave, and elect a new leader in a short interval
[ https://issues.apache.org/jira/browse/MESOS-6205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kasim updated MESOS-6205: - Description: I follow this [doc][https://open.mesosphere.com/getting-started/install/#verifying-installation] to setup mesos cluster. There are three vm(ubuntu 12, centos 6.5, centos 7.2). $ cat /etc/hosts 10.142.55.190 zk1 10.142.55.196 zk2 10.142.55.202 zk3 config in each mathine: $ cat /etc/mesos/zk zk://10.142.55.190:2181,10.142.55.196:2181,10.142.55.202:2181/mesos After start zookeeper, mesos-master and mesos-slave in three vm, I can view the mesos webui(10.142.55.190:5050), but agents count is 0. After a little time, mesos page get error: Failed to connect to 10.142.55.190:5050! Retrying in 16 seconds... (I found that zookeeper would elect a new leader in a short interval) mesos-master cmd: ``` mesos-master --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" --allocation_interval="1secs" --allocator="HierarchicalDRF" --authenticate_agents="false" --authenticate_frameworks="false" --authenticate_http_frameworks="false" --authenticate_http_readonly="false" --authenticate_http_readwrite="false" --authenticators="crammd5" --authorizers="local" --framework_sorter="drf" --help="false" --hostname_lookup="true" --http_authenticators="basic" --initialize_driver_logging="true" --ip="10.142.55.190" --log_auto_initialize="true" --log_dir="/var/log/mesos" --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" --port="5050" --quiet="false" --quorum="2" --recovery_agent_removal_limit="100%" --registry="replicated_log" --registry_fetch_timeout="1mins" --registry_store_timeout="20secs" --registry_strict="false" --root_submissions="true" --user_sorter="drf" --version="false" --webui_dir="/usr/share/mesos/webui" --work_dir="/var/lib/mesos" --zk="zk://10.142.55.190:2181,10.142.55.196:2181,10.142.55.202:2181/mesos" ``` mesos-slave cmd: ``` mesos-slave --appc_simple_discovery_uri_prefix="http://"; --appc_store_dir="/tmp/mesos/store/appc" --authenticate_http_readonly="false" --authenticate_http_readwrite="false" --authenticatee="crammd5" --authentication_backoff_factor="1secs" --authorizer="local" --cgroups_cpu_enable_pids_and_tids_count="false" --cgroups_enable_cfs="false" --cgroups_hierarchy="/sys/fs/cgroup" --cgroups_limit_swap="false" --cgroups_root="mesos" --container_disk_watch_interval="15secs" --containerizers="mesos" --default_role="*" --disk_watch_interval="1mins" --docker="docker" --docker_kill_orphans="true" --docker_registry="https://registry-1.docker.io"; --docker_remove_delay="6hrs" --docker_socket="/var/run/docker.sock" --docker_stop_timeout="0ns" --docker_store_dir="/tmp/mesos/store/docker" --docker_volume_checkpoint_dir="/var/run/mesos/isolators/docker/volume" --enforce_container_disk_quota="false" --executor_registration_timeout="1mins" --executor_shutdown_grace_period="5secs" --fetcher_cache_dir="/tmp/mesos/fetch" --fetcher_cache_size="2GB" --frameworks_home="" --gc_delay="1weeks" --gc_disk_headroom="0.1" --hadoop_home="" --help="false" --hostname="10.142.55.190" --hostname_lookup="true" --http_authenticators="basic" --http_command_executor="false" --image_provisioner_backend="copy" --initialize_driver_logging="true" --ip="10.142.55.190" --isolation="posix/cpu,posix/mem" --launcher="posix" --launcher_dir="/usr/libexec/mesos" --log_dir="/var/log/mesos" --logbufsecs="0" --logging_level="INFO" --master="zk://10.142.55.190:2181,10.142.55.196:2181,10.142.55.202:2181/mesos" --oversubscribed_resources_interval="15secs" --perf_duration="10secs" --perf_interval="1mins" --port="5051" --qos_correction_interval_min="0ns" --quiet="false" --recover="reconnect" --recovery_timeout="15mins" --registration_backoff_factor="1secs" --revocable_cpu_low_priority="true" --sandbox_directory="/mnt/mesos/sandbox" --strict="true" --switch_user="true" --systemd_enable_support="true" --systemd_runtime_directory="/run/systemd/system" --version="false" --work_dir="/var/lib/mesos" ``` When I run mesos-master from command-line, I got ``` I0919 17:20:19.286264 17550 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (583)@10.142.55.202:5050 F0919 17:20:20.009371 17556 master.cpp:1536] Recovery failed: Failed to recover registrar: Failed to perform fetch within 1mins *** Check failure stack trace: *** @ 0x7f9db78458dd google::LogMessage::Fail() @ 0x7f9db784771d google::LogMessage::SendToLog() @ 0x7f9db78454cc google::LogMessage::Flush() @ 0x7f9db7848019 google::LogMessageFatal::~LogMessageFatal() @ 0x7f9db6e2dbbc mesos::internal::master::fail() @ 0x7f9db6e75b20 _ZNSt17_Function_handlerIFvRKSsEZNK7process6FutureI7NothingE8onFailedISt5_BindIFPFvS1_S1_EPKcSt12_PlaceholderI
[jira] [Updated] (MESOS-6110) Deprecate using health checks without setting the type
[ https://issues.apache.org/jira/browse/MESOS-6110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Rukletsov updated MESOS-6110: --- Sprint: Mesosphere Sprint 42, Mesosphere Sprint 43 (was: Mesosphere Sprint 42) > Deprecate using health checks without setting the type > -- > > Key: MESOS-6110 > URL: https://issues.apache.org/jira/browse/MESOS-6110 > Project: Mesos > Issue Type: Bug > Components: master >Affects Versions: 1.1.0 >Reporter: Silas Snider >Assignee: haosdent >Priority: Blocker > Labels: compatibility, health-check, mesosphere > > When sending a task launch using the 1.0.x protos and the legacy (non-http) > API, tasks with a healthcheck defined are rejected (TASK_ERROR) because the > 'type' field is not set. > This field is marked optional in the proto and is not available before 1.1.0, > so it should not be required in order to keep the mesos v1 api compatibility > promise. > For backwards compatibility temporarily allow the use case when command > health check is set without a type. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-5987) Update health check protobuf for HTTP and TCP health check
[ https://issues.apache.org/jira/browse/MESOS-5987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Rukletsov updated MESOS-5987: --- Sprint: Mesosphere Sprint 40, Mesosphere Sprint 42, Mesosphere Sprint 43 (was: Mesosphere Sprint 40, Mesosphere Sprint 42) > Update health check protobuf for HTTP and TCP health check > -- > > Key: MESOS-5987 > URL: https://issues.apache.org/jira/browse/MESOS-5987 > Project: Mesos > Issue Type: Task >Reporter: haosdent >Assignee: haosdent > Labels: health-check, mesosphere > Fix For: 1.1.0 > > > To support HTTP and TCP health check, we need to update the existing > {{HealthCheck}} protobuf message according to [~alexr] and [~gaston] > commented in https://reviews.apache.org/r/36816/ and > https://reviews.apache.org/r/49360/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6119) TCP health checks are not portable.
[ https://issues.apache.org/jira/browse/MESOS-6119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Rukletsov updated MESOS-6119: --- Sprint: Mesosphere Sprint 42, Mesosphere Sprint 43 (was: Mesosphere Sprint 42) > TCP health checks are not portable. > --- > > Key: MESOS-6119 > URL: https://issues.apache.org/jira/browse/MESOS-6119 > Project: Mesos > Issue Type: Bug >Reporter: Alexander Rukletsov >Assignee: Alexander Rukletsov > Labels: health-check, mesosphere > > MESOS-3567 introduced a dependency on "bash" for TCP health checks, which is > undesirable. We should implement a portable solution for TCP health checks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6202) Docker containerizer kills containers whose name starts with 'mesos-'
[ https://issues.apache.org/jira/browse/MESOS-6202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15502699#comment-15502699 ] Marc Villacorta commented on MESOS-6202: Would you considere adding a validation to make sure {{id}} is a valid Docker UUID? > Docker containerizer kills containers whose name starts with 'mesos-' > - > > Key: MESOS-6202 > URL: https://issues.apache.org/jira/browse/MESOS-6202 > Project: Mesos > Issue Type: Bug > Components: containerization, docker >Affects Versions: 1.0.1 > Environment: Dockerized > {{mesosphere/mesos-slave:1.0.1-2.0.93.ubuntu1404}} >Reporter: Marc Villacorta > > I run 3 docker containers in my CoreOS system whose names start with > _'mesos-'_ those are: _'mesos-master'_, _'mesos-dns'_ and _'mesos-agent'_. > I can start the first two without any problem but when I start the third one > _('mesos-agent')_ all three containers are killed by the docker daemon. > If I rename the containers to _'m3s0s-master'_, _'m3s0s-dns'_ and > _'m3s0s-agent'_ everything works. > I tracked down the problem to > [this|https://github.com/apache/mesos/blob/16a563aca1f226b021b8f8815c4d115a3212f02b/src/slave/containerizer/docker.cpp#L116-L120] > code which is marked to be removed after deprecation cycle. > I was previously running Mesos 0.28.2 without this problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6205) mesos-master can not found mesos-slave, and elect a new leader in a short interval
[ https://issues.apache.org/jira/browse/MESOS-6205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kasim updated MESOS-6205: - Description: I follow this [doc][https://open.mesosphere.com/getting-started/install/#verifying-installation] to setup mesos cluster. There are three vm(ubuntu 12, centos 6.5, centos 7.2). $ cat /etc/hosts 10.142.55.190 zk1 10.142.55.196 zk2 10.142.55.202 zk3 config in each mathine: $ cat /etc/mesos/zk zk://10.142.55.190:2181,10.142.55.196:2181,10.142.55.202:2181/mesos After start zookeeper, mesos-master and mesos-slave in three vm, I can view the mesos webui(10.142.55.190:5050), but agents count is 0. After a little time, mesos page get error: Failed to connect to 10.142.55.190:5050! Retrying in 16 seconds... (I found that zookeeper would elect a new leader in a short interval) master info log: I0919 15:54:59.677438 13281 http.cpp:2022] Redirecting request for /master/state?jsonp=angular.callbacks._1x to the leading master zk3 I0919 15:55:00.098667 13281 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (768)@10.142.55.202:5050 I0919 15:55:00.385279 13281 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (185)@10.142.55.196:5050 I0919 15:55:00.79 13281 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (771)@10.142.55.202:5050 I0919 15:55:01.347291 13284 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (188)@10.142.55.196:5050 I0919 15:55:01.597682 13284 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (774)@10.142.55.202:5050 I0919 15:55:02.257159 13282 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (191)@10.142.55.196:5050 I0919 15:55:02.370692 13287 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (777)@10.142.55.202:5050 I0919 15:55:03.205920 13285 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (780)@10.142.55.202:5050 I0919 15:55:03.260007 13281 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (194)@10.142.55.196:5050 I0919 15:55:03.929611 13283 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (783)@10.142.55.202:5050 I0919 15:55:04.033308 13287 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (197)@10.142.55.196:5050 I0919 15:55:04.591275 13284 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (200)@10.142.55.196:5050 I0919 15:55:04.608211 13283 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (786)@10.142.55.202:5050 I0919 15:55:05.184682 13280 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (789)@10.142.55.202:5050 I0919 15:55:05.268277 13280 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (203)@10.142.55.196:5050 I0919 15:55:05.775377 13281 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (206)@10.142.55.196:5050 I0919 15:55:05.916445 13285 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (792)@10.142.55.202:5050 I0919 15:55:06.744927 13280 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (209)@10.142.55.196:5050 I0919 15:55:07.378521 13283 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (5)@10.142.55.202:5050 I0919 15:55:07.393311 13285 network.hpp:430] ZooKeeper group memberships changed I0919 15:55:07.393427 13285 group.cpp:706] Trying to get '/mesos/log_replicas/000709' in ZooKeeper I0919 15:55:07.393985 13285 group.cpp:706] Trying to get '/mesos/log_replicas/000711' in ZooKeeper I0919 15:55:07.394394 13285 group.cpp:706] Trying to get '/mesos/log_replicas/000714' in ZooKeeper I0919 15:55:07.394843 13285 group.cpp:706] Trying to get '/mesos/log_replicas/000715' in ZooKeeper I0919 15:55:07.395418 13285 network.hpp:478] ZooKeeper group PIDs: { log-replica(1)@10.142.55.190:5050, log-replica(1)@10.142.55.196:5050, log-replica(1)@10.142.55.202:5050 } I0919 15:55:08.178272 13280 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (14)@10.142.55.202:5050 I0919 15:55:09.059562 13282 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (21)@10.142.55.202:5050 I0919 15:55:09.700711 13286 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (24)@10.142.55.202:5050 I0919 15:55:09.742185 13287 http.cpp:381] HTTP GET for /m
[jira] [Updated] (MESOS-6205) mesos-master can not found mesos-slave, and elect a new leader in a short interval
[ https://issues.apache.org/jira/browse/MESOS-6205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kasim updated MESOS-6205: - Description: I follow this [doc][https://open.mesosphere.com/getting-started/install/#verifying-installation] to setup mesos cluster. There are three vm(ubuntu 12, centos 6.5, centos 7.2). $ cat /etc/hosts 10.142.55.190 zk1 10.142.55.196 zk2 10.142.55.202 zk3 config in each mathine: $ cat /etc/mesos/zk zk://10.142.55.190:2181,10.142.55.196:2181,10.142.55.202:2181/mesos After I start zookeeper, mesos-master and mesos-slave in three vm, I can view the mesos webui(10.142.55.190:5050). I found agents count is 0. After a little time, mesos page get error: Failed to connect to 10.142.55.190:5050! Retrying in 16 seconds... (I found that zookeeper would elect a new leader in a short interval) master info log: I0919 15:54:59.677438 13281 http.cpp:2022] Redirecting request for /master/state?jsonp=angular.callbacks._1x to the leading master zk3 I0919 15:55:00.098667 13281 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (768)@10.142.55.202:5050 I0919 15:55:00.385279 13281 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (185)@10.142.55.196:5050 I0919 15:55:00.79 13281 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (771)@10.142.55.202:5050 I0919 15:55:01.347291 13284 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (188)@10.142.55.196:5050 I0919 15:55:01.597682 13284 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (774)@10.142.55.202:5050 I0919 15:55:02.257159 13282 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (191)@10.142.55.196:5050 I0919 15:55:02.370692 13287 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (777)@10.142.55.202:5050 I0919 15:55:03.205920 13285 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (780)@10.142.55.202:5050 I0919 15:55:03.260007 13281 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (194)@10.142.55.196:5050 I0919 15:55:03.929611 13283 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (783)@10.142.55.202:5050 I0919 15:55:04.033308 13287 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (197)@10.142.55.196:5050 I0919 15:55:04.591275 13284 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (200)@10.142.55.196:5050 I0919 15:55:04.608211 13283 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (786)@10.142.55.202:5050 I0919 15:55:05.184682 13280 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (789)@10.142.55.202:5050 I0919 15:55:05.268277 13280 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (203)@10.142.55.196:5050 I0919 15:55:05.775377 13281 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (206)@10.142.55.196:5050 I0919 15:55:05.916445 13285 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (792)@10.142.55.202:5050 I0919 15:55:06.744927 13280 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (209)@10.142.55.196:5050 I0919 15:55:07.378521 13283 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (5)@10.142.55.202:5050 I0919 15:55:07.393311 13285 network.hpp:430] ZooKeeper group memberships changed I0919 15:55:07.393427 13285 group.cpp:706] Trying to get '/mesos/log_replicas/000709' in ZooKeeper I0919 15:55:07.393985 13285 group.cpp:706] Trying to get '/mesos/log_replicas/000711' in ZooKeeper I0919 15:55:07.394394 13285 group.cpp:706] Trying to get '/mesos/log_replicas/000714' in ZooKeeper I0919 15:55:07.394843 13285 group.cpp:706] Trying to get '/mesos/log_replicas/000715' in ZooKeeper I0919 15:55:07.395418 13285 network.hpp:478] ZooKeeper group PIDs: { log-replica(1)@10.142.55.190:5050, log-replica(1)@10.142.55.196:5050, log-replica(1)@10.142.55.202:5050 } I0919 15:55:08.178272 13280 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (14)@10.142.55.202:5050 I0919 15:55:09.059562 13282 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (21)@10.142.55.202:5050 I0919 15:55:09.700711 13286 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (24)@10.142.55.202:5050 I0919 15:55:09.742185 13287 http.cpp:381] HTTP GET
[jira] [Updated] (MESOS-6205) mesos-master can not found mesos-slave, and elect a new leader in a short interval
[ https://issues.apache.org/jira/browse/MESOS-6205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kasim updated MESOS-6205: - Description: I follow this [doc][https://open.mesosphere.com/getting-started/install/#verifying-installation] to setup mesos cluster. There are three vm(ubuntu 12, centos 6.5, centos 7.2). $ cat /etc/hosts 10.142.55.190 zk1 10.142.55.196 zk2 10.142.55.202 zk3 config in each mathine: $ cat /etc/mesos/zk zk://10.142.55.190:2181,10.142.55.196:2181,10.142.55.202:2181/mesos After I start zookeeper, mesos-master and mesos-slave in three vm, I can view the mesos webui(10.142.55.190:5050). I found agents count is 0. After a little time, mesos page get error: Failed to connect to 10.142.55.190:5050! Retrying in 16 seconds... (I found that zookeeper would elect a new leader in a short interval) master info log: I0919 15:54:59.677438 13281 http.cpp:2022] Redirecting request for /master/state?jsonp=angular.callbacks._1x to the leading master zk3 I0919 15:55:00.098667 13281 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (768)@10.142.55.202:5050 I0919 15:55:00.385279 13281 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (185)@10.142.55.196:5050 I0919 15:55:00.79 13281 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (771)@10.142.55.202:5050 I0919 15:55:01.347291 13284 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (188)@10.142.55.196:5050 I0919 15:55:01.597682 13284 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (774)@10.142.55.202:5050 I0919 15:55:02.257159 13282 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (191)@10.142.55.196:5050 I0919 15:55:02.370692 13287 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (777)@10.142.55.202:5050 I0919 15:55:03.205920 13285 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (780)@10.142.55.202:5050 I0919 15:55:03.260007 13281 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (194)@10.142.55.196:5050 I0919 15:55:03.929611 13283 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (783)@10.142.55.202:5050 I0919 15:55:04.033308 13287 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (197)@10.142.55.196:5050 I0919 15:55:04.591275 13284 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (200)@10.142.55.196:5050 I0919 15:55:04.608211 13283 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (786)@10.142.55.202:5050 I0919 15:55:05.184682 13280 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (789)@10.142.55.202:5050 I0919 15:55:05.268277 13280 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (203)@10.142.55.196:5050 I0919 15:55:05.775377 13281 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (206)@10.142.55.196:5050 I0919 15:55:05.916445 13285 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (792)@10.142.55.202:5050 I0919 15:55:06.744927 13280 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (209)@10.142.55.196:5050 I0919 15:55:07.378521 13283 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (5)@10.142.55.202:5050 I0919 15:55:07.393311 13285 network.hpp:430] ZooKeeper group memberships changed I0919 15:55:07.393427 13285 group.cpp:706] Trying to get '/mesos/log_replicas/000709' in ZooKeeper I0919 15:55:07.393985 13285 group.cpp:706] Trying to get '/mesos/log_replicas/000711' in ZooKeeper I0919 15:55:07.394394 13285 group.cpp:706] Trying to get '/mesos/log_replicas/000714' in ZooKeeper I0919 15:55:07.394843 13285 group.cpp:706] Trying to get '/mesos/log_replicas/000715' in ZooKeeper I0919 15:55:07.395418 13285 network.hpp:478] ZooKeeper group PIDs: { log-replica(1)@10.142.55.190:5050, log-replica(1)@10.142.55.196:5050, log-replica(1)@10.142.55.202:5050 } I0919 15:55:08.178272 13280 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (14)@10.142.55.202:5050 I0919 15:55:09.059562 13282 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (21)@10.142.55.202:5050 I0919 15:55:09.700711 13286 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (24)@10.142.55.202:5050 I0919 15:55:09.742185 13287 http.cpp:381] HTTP GET
[jira] [Created] (MESOS-6205) mesos-master can not found mesos-slave, and elect a new leader in a short interval
kasim created MESOS-6205: Summary: mesos-master can not found mesos-slave, and elect a new leader in a short interval Key: MESOS-6205 URL: https://issues.apache.org/jira/browse/MESOS-6205 Project: Mesos Issue Type: Bug Components: master Environment: ubuntu 12 x64, centos 6.5 x64, centos 7.2 x64 Reporter: kasim I follow this [doc][1] to setup mesos cluster. There are three vm(ubuntu 12, centos 6.5, centos 7.2). $ cat /etc/hosts 10.142.55.190 zk1 10.142.55.196 zk2 10.142.55.202 zk3 config in each mathine: $ cat /etc/mesos/zk zk://10.142.55.190:2181,10.142.55.196:2181,10.142.55.202:2181/mesos After I start mesos-master in three vm, I can view the mesos webui(10.142.55.190:5050), but after a little time, mesos page get error: Failed to connect to 10.142.55.190:5050! Retrying in 16 seconds... (I found that zookeeper would elect a new leader in a short interval) master info log: I0919 15:54:59.677438 13281 http.cpp:2022] Redirecting request for /master/state?jsonp=angular.callbacks._1x to the leading master zk3 I0919 15:55:00.098667 13281 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (768)@10.142.55.202:5050 I0919 15:55:00.385279 13281 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (185)@10.142.55.196:5050 I0919 15:55:00.79 13281 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (771)@10.142.55.202:5050 I0919 15:55:01.347291 13284 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (188)@10.142.55.196:5050 I0919 15:55:01.597682 13284 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (774)@10.142.55.202:5050 I0919 15:55:02.257159 13282 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (191)@10.142.55.196:5050 I0919 15:55:02.370692 13287 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (777)@10.142.55.202:5050 I0919 15:55:03.205920 13285 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (780)@10.142.55.202:5050 I0919 15:55:03.260007 13281 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (194)@10.142.55.196:5050 I0919 15:55:03.929611 13283 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (783)@10.142.55.202:5050 I0919 15:55:04.033308 13287 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (197)@10.142.55.196:5050 I0919 15:55:04.591275 13284 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (200)@10.142.55.196:5050 I0919 15:55:04.608211 13283 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (786)@10.142.55.202:5050 I0919 15:55:05.184682 13280 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (789)@10.142.55.202:5050 I0919 15:55:05.268277 13280 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (203)@10.142.55.196:5050 I0919 15:55:05.775377 13281 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (206)@10.142.55.196:5050 I0919 15:55:05.916445 13285 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (792)@10.142.55.202:5050 I0919 15:55:06.744927 13280 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (209)@10.142.55.196:5050 I0919 15:55:07.378521 13283 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (5)@10.142.55.202:5050 I0919 15:55:07.393311 13285 network.hpp:430] ZooKeeper group memberships changed I0919 15:55:07.393427 13285 group.cpp:706] Trying to get '/mesos/log_replicas/000709' in ZooKeeper I0919 15:55:07.393985 13285 group.cpp:706] Trying to get '/mesos/log_replicas/000711' in ZooKeeper I0919 15:55:07.394394 13285 group.cpp:706] Trying to get '/mesos/log_replicas/000714' in ZooKeeper I0919 15:55:07.394843 13285 group.cpp:706] Trying to get '/mesos/log_replicas/000715' in ZooKeeper I0919 15:55:07.395418 13285 network.hpp:478] ZooKeeper group PIDs: { log-replica(1)@10.142.55.190:5050, log-replica(1)@10.142.55.196:5050, log-replica(1)@10.142.55.202:5050 } I0919 15:55:08.178272 13280 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (14)@10.142.55.202:5050 I0919 15:55:09.059562 13282 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (21)@10.142.55.202:5050 I0919 15:55:09.700711 13286 replica.cpp:673] Replica in VOTING status