[jira] [Comment Edited] (MESOS-6002) The whiteout file cannot be removed correctly using aufs backend.

2016-09-19 Thread JIRA

[ 
https://issues.apache.org/jira/browse/MESOS-6002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15504388#comment-15504388
 ] 

Stéphane Cottin edited comment on MESOS-6002 at 9/20/16 6:19 AM:
-

Same issue using overlayfs :

{code}
Failed to remove whiteout file 
'/mnt/mesos/provisioner/containers/001bbc00-e460-4c15-a445-e3dd44f3dd8c/backends/overlay/rootfses/acb9cba3-671d-41a8-ad73-9b160f3ca048/var/lib/apt/lists/partial/.wh..opq':
 No such file or directory
{code}

Can be reproduced with official postgres, rabbitmq and many more docker images, 
all deleting the same folder in multiple RUN calls.

update: forgot to mention, patched with https://reviews.apache.org/r/51124/ 


was (Author: kaalh):
Same issue using overlayfs :

{code}
Failed to remove whiteout file 
'/mnt/mesos/provisioner/containers/001bbc00-e460-4c15-a445-e3dd44f3dd8c/backends/overlay/rootfses/acb9cba3-671d-41a8-ad73-9b160f3ca048/var/lib/apt/lists/partial/.wh..opq':
 No such file or directory
{code}

Can be reproduced with official postgres, rabbitmq and many more docker images, 
all deleting the same folder in multiple RUN calls.

> The whiteout file cannot be removed correctly using aufs backend.
> -
>
> Key: MESOS-6002
> URL: https://issues.apache.org/jira/browse/MESOS-6002
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
> Environment: Ubuntu 14, Ubuntu 12
> Or any os with aufs module
>Reporter: Gilbert Song
>  Labels: aufs, backend, containerizer
>
> The whiteout file is not removed correctly when using the aufs backend in 
> unified containerizer. It can be verified by this unit test with the aufs 
> manually specified.
> {noformat}
> [20:11:24] :   [Step 10/10] [ RUN  ] 
> ProvisionerDockerPullerTest.ROOT_INTERNET_CURL_Whiteout
> [20:11:24]W:   [Step 10/10] I0805 20:11:24.986734 24295 cluster.cpp:155] 
> Creating default 'local' authorizer
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.001153 24295 leveldb.cpp:174] 
> Opened db in 14.308627ms
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.003731 24295 leveldb.cpp:181] 
> Compacted db in 2.558329ms
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.003749 24295 leveldb.cpp:196] 
> Created db iterator in 3086ns
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.003754 24295 leveldb.cpp:202] 
> Seeked to beginning of db in 595ns
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.003758 24295 leveldb.cpp:271] 
> Iterated through 0 keys in the db in 314ns
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.003769 24295 replica.cpp:776] 
> Replica recovered with log positions 0 -> 0 with 1 holes and 0 unlearned
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.004086 24315 recover.cpp:451] 
> Starting replica recovery
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.004251 24312 recover.cpp:477] 
> Replica is in EMPTY status
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.004546 24314 replica.cpp:673] 
> Replica in EMPTY status received a broadcasted recover request from 
> __req_res__(5640)@172.30.2.105:36006
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.004607 24312 recover.cpp:197] 
> Received a recover response from a replica in EMPTY status
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.004762 24313 recover.cpp:568] 
> Updating replica status to STARTING
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.004776 24314 master.cpp:375] 
> Master 21665992-d47e-402f-a00c-6f8fab613019 (ip-172-30-2-105.mesosphere.io) 
> started on 172.30.2.105:36006
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.004787 24314 master.cpp:377] Flags 
> at startup: --acls="" --agent_ping_timeout="15secs" 
> --agent_reregister_timeout="10mins" --allocation_interval="1secs" 
> --allocator="HierarchicalDRF" --authenticate_agents="true" 
> --authenticate_frameworks="true" --authenticate_http_frameworks="true" 
> --authenticate_http_readonly="true" --authenticate_http_readwrite="true" 
> --authenticators="crammd5" --authorizers="local" 
> --credentials="/tmp/0z753P/credentials" --framework_sorter="drf" 
> --help="false" --hostname_lookup="true" --http_authenticators="basic" 
> --http_framework_authenticators="basic" --initialize_driver_logging="true" 
> --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" 
> --max_agent_ping_timeouts="5" --max_completed_frameworks="50" 
> --max_completed_tasks_per_framework="1000" --quiet="false" 
> --recovery_agent_removal_limit="100%" --registry="replicated_log" 
> --registry_fetch_timeout="1mins" --registry_store_timeout="100secs" 
> --registry_strict="true" --root_submissions="true" --user_sorter="drf" 
> --version="false" --webui_dir="/usr/local/share/mesos/webui" 
> --work_dir="/tmp/0z753P/master" --zk_session_timeout="10secs"
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.004920 24314 master.cpp:427] 
> Master only allowing auth

[jira] [Comment Edited] (MESOS-6205) mesos-master can not found mesos-slave, and elect a new leader in a short interval

2016-09-19 Thread kasim (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15505567#comment-15505567
 ] 

kasim edited comment on MESOS-6205 at 9/20/16 4:53 AM:
---

Thanks, empty work_dir works. But I don't understand how this situation happen. 

At first, I started only one master and zookeeper for test.

{code}
$ cat /etc/mesos/zk
zk://10.142.55.190:2181/mesos
{code}

The slave on same machine was able to connect master, but other couldn't.

So I tried to start three mesos-master and zookeepers to consist cluster, and 
change `/etc/mesos/zk` to

{code}
zk://10.142.55.190:2181,10.142.55.196:2181,10.142.55.202:2181/mesos
{code}, then got above error.


Is this mean I need clear wrok_dir everytime when adding a new mesos-master?


was (Author: mithril):
Thanks, empty work_dir works. But I don't understand how this situation happen. 

At first, I started only one master and zookeeper for test.

{code}
$ cat /etc/mesos/zk
zk://10.142.55.190:2181/mesos
{code}

The slave on same machine was able to connect master, but other couldn't.

So I tried to start three master to consist cluster, change `/etc/mesos/zk` to

{code}
zk://10.142.55.190:2181,10.142.55.196:2181,10.142.55.202:2181/mesos
{code}, then got above error.


Is this mean I need clear wrok_dir everytime when adding a new mesos-master?

> mesos-master can not found mesos-slave, and elect a new leader in a short 
> interval
> --
>
> Key: MESOS-6205
> URL: https://issues.apache.org/jira/browse/MESOS-6205
> Project: Mesos
>  Issue Type: Bug
>  Components: master
> Environment: ubuntu 12 x64, centos 6.5 x64, centos 7.2 x64
>Reporter: kasim
>
> I follow this 
> [doc][https://open.mesosphere.com/getting-started/install/#verifying-installation]
>  to setup mesos cluster.
> There are three vm(ubuntu 12, centos 6.5, centos 7.2).
> {code}
> $ cat /etc/hosts
> 10.142.55.190 zk1
> 10.142.55.196 zk2
> 10.142.55.202 zk3
> {code}
> config in each mathine:
> {code}
> $ cat /etc/mesos/zk
> zk://10.142.55.190:2181,10.142.55.196:2181,10.142.55.202:2181/mesos
> {code}
> 
> After start zookeeper, mesos-master and mesos-slave in three vm, I can view 
> the mesos webui(10.142.55.190:5050), but agents count is 0.
> After a little time, mesos page get error:
> {code}
> Failed to connect to 10.142.55.190:5050!
> Retrying in 16 seconds... 
> {code}
> (I found that zookeeper would elect a new leader in a short interval)
> 
> mesos-master cmd:
> {code}
> mesos-master --agent_ping_timeout="15secs" 
> --agent_reregister_timeout="10mins" --allocation_interval="1secs" 
> --allocator="HierarchicalDRF" --authenticate_agents="false" 
> --authenticate_frameworks="false" --authenticate_http_frameworks="false" 
> --authenticate_http_readonly="false" --authenticate_http_readwrite="false" 
> --authenticators="crammd5" --authorizers="local" --framework_sorter="drf" 
> --help="false" --hostname_lookup="true" --http_authenticators="basic" 
> --initialize_driver_logging="true" --ip="10.142.55.190" 
> --log_auto_initialize="true" --log_dir="/var/log/mesos" --logbufsecs="0" 
> --logging_level="INFO" --max_agent_ping_timeouts="5" 
> --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" 
> --port="5050" --quiet="false" --quorum="2" 
> --recovery_agent_removal_limit="100%" --registry="replicated_log" 
> --registry_fetch_timeout="1mins" --registry_store_timeout="20secs" 
> --registry_strict="false" --root_submissions="true" --user_sorter="drf" 
> --version="false" --webui_dir="/usr/share/mesos/webui" 
> --work_dir="/var/lib/mesos" 
> --zk="zk://10.142.55.190:2181,10.142.55.196:2181,10.142.55.202:2181/mesos"
> {code}
> mesos-slave cmd:
> {code}
> mesos-slave --appc_simple_discovery_uri_prefix="http://"; 
> --appc_store_dir="/tmp/mesos/store/appc" --authenticate_http_readonly="false" 
> --authenticate_http_readwrite="false" --authenticatee="crammd5" 
> --authentication_backoff_factor="1secs" --authorizer="local" 
> --cgroups_cpu_enable_pids_and_tids_count="false" --cgroups_enable_cfs="false" 
> --cgroups_hierarchy="/sys/fs/cgroup" --cgroups_limit_swap="false" 
> --cgroups_root="mesos" --container_disk_watch_interval="15secs" 
> --containerizers="mesos" --default_role="*" --disk_watch_interval="1mins" 
> --docker="docker" --docker_kill_orphans="true" 
> --docker_registry="https://registry-1.docker.io"; --docker_remove_delay="6hrs" 
> --docker_socket="/var/run/docker.sock" --docker_stop_timeout="0ns" 
> --docker_store_dir="/tmp/mesos/store/docker" 
> --docker_volume_checkpoint_dir="/var/run/mesos/isolators/docker/volume" 
> --enforce_container_disk_quota="false" 
> --executor_registration_timeout="1mins" 
> --ex

[jira] [Comment Edited] (MESOS-6205) mesos-master can not found mesos-slave, and elect a new leader in a short interval

2016-09-19 Thread kasim (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15505567#comment-15505567
 ] 

kasim edited comment on MESOS-6205 at 9/20/16 4:51 AM:
---

Thanks, empty work_dir works. But I don't understand how this situation happen. 

At first, I started only one master and zookeeper for test.

{code}
$ cat /etc/mesos/zk
zk://10.142.55.190:2181/mesos
{code}

The slave on same machine was able to connect master, but other couldn't.

So I tried to start three master to consist cluster, change `/etc/mesos/zk` to

{code}
zk://10.142.55.190:2181,10.142.55.196:2181,10.142.55.202:2181/mesos
{code}, then got above error.


Is this mean I need clear wrok_dir everytime when adding a new mesos-master?


was (Author: mithril):
Thanks, empty work_dir work. But I don't understand how this situation happen. 

At first, I started only one master and zookeeper for test.

{code}
$ cat /etc/mesos/zk
zk://10.142.55.190:2181/mesos
{code}

The slave on same machine was able to connect master, but other couldn't.

So I tried to start three master to consist cluster, change `/etc/mesos/zk` to

{code}
zk://10.142.55.190:2181,10.142.55.196:2181,10.142.55.202:2181/mesos
{code}, then got above error.


Is this mean I need clear wrok_dir everytime when adding a new mesos-master?

> mesos-master can not found mesos-slave, and elect a new leader in a short 
> interval
> --
>
> Key: MESOS-6205
> URL: https://issues.apache.org/jira/browse/MESOS-6205
> Project: Mesos
>  Issue Type: Bug
>  Components: master
> Environment: ubuntu 12 x64, centos 6.5 x64, centos 7.2 x64
>Reporter: kasim
>
> I follow this 
> [doc][https://open.mesosphere.com/getting-started/install/#verifying-installation]
>  to setup mesos cluster.
> There are three vm(ubuntu 12, centos 6.5, centos 7.2).
> {code}
> $ cat /etc/hosts
> 10.142.55.190 zk1
> 10.142.55.196 zk2
> 10.142.55.202 zk3
> {code}
> config in each mathine:
> {code}
> $ cat /etc/mesos/zk
> zk://10.142.55.190:2181,10.142.55.196:2181,10.142.55.202:2181/mesos
> {code}
> 
> After start zookeeper, mesos-master and mesos-slave in three vm, I can view 
> the mesos webui(10.142.55.190:5050), but agents count is 0.
> After a little time, mesos page get error:
> {code}
> Failed to connect to 10.142.55.190:5050!
> Retrying in 16 seconds... 
> {code}
> (I found that zookeeper would elect a new leader in a short interval)
> 
> mesos-master cmd:
> {code}
> mesos-master --agent_ping_timeout="15secs" 
> --agent_reregister_timeout="10mins" --allocation_interval="1secs" 
> --allocator="HierarchicalDRF" --authenticate_agents="false" 
> --authenticate_frameworks="false" --authenticate_http_frameworks="false" 
> --authenticate_http_readonly="false" --authenticate_http_readwrite="false" 
> --authenticators="crammd5" --authorizers="local" --framework_sorter="drf" 
> --help="false" --hostname_lookup="true" --http_authenticators="basic" 
> --initialize_driver_logging="true" --ip="10.142.55.190" 
> --log_auto_initialize="true" --log_dir="/var/log/mesos" --logbufsecs="0" 
> --logging_level="INFO" --max_agent_ping_timeouts="5" 
> --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" 
> --port="5050" --quiet="false" --quorum="2" 
> --recovery_agent_removal_limit="100%" --registry="replicated_log" 
> --registry_fetch_timeout="1mins" --registry_store_timeout="20secs" 
> --registry_strict="false" --root_submissions="true" --user_sorter="drf" 
> --version="false" --webui_dir="/usr/share/mesos/webui" 
> --work_dir="/var/lib/mesos" 
> --zk="zk://10.142.55.190:2181,10.142.55.196:2181,10.142.55.202:2181/mesos"
> {code}
> mesos-slave cmd:
> {code}
> mesos-slave --appc_simple_discovery_uri_prefix="http://"; 
> --appc_store_dir="/tmp/mesos/store/appc" --authenticate_http_readonly="false" 
> --authenticate_http_readwrite="false" --authenticatee="crammd5" 
> --authentication_backoff_factor="1secs" --authorizer="local" 
> --cgroups_cpu_enable_pids_and_tids_count="false" --cgroups_enable_cfs="false" 
> --cgroups_hierarchy="/sys/fs/cgroup" --cgroups_limit_swap="false" 
> --cgroups_root="mesos" --container_disk_watch_interval="15secs" 
> --containerizers="mesos" --default_role="*" --disk_watch_interval="1mins" 
> --docker="docker" --docker_kill_orphans="true" 
> --docker_registry="https://registry-1.docker.io"; --docker_remove_delay="6hrs" 
> --docker_socket="/var/run/docker.sock" --docker_stop_timeout="0ns" 
> --docker_store_dir="/tmp/mesos/store/docker" 
> --docker_volume_checkpoint_dir="/var/run/mesos/isolators/docker/volume" 
> --enforce_container_disk_quota="false" 
> --executor_registration_timeout="1mins" 
> --executor_shutdown_grace_perio

[jira] [Commented] (MESOS-6205) mesos-master can not found mesos-slave, and elect a new leader in a short interval

2016-09-19 Thread kasim (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15505567#comment-15505567
 ] 

kasim commented on MESOS-6205:
--

Thanks, empty work_dir work. But I don't understand how this situation happen. 

At first, I started only one master and zookeeper for test.

{code}
$ cat /etc/mesos/zk
zk://10.142.55.190:2181/mesos
{code}

The slave on same machine was able to connect master, but other couldn't.

So I tried to start three master to consist cluster, change `/etc/mesos/zk` to

{code}
zk://10.142.55.190:2181,10.142.55.196:2181,10.142.55.202:2181/mesos
{code}, then got above error.


Is this mean I need clear wrok_dir everytime when adding a new mesos-master?

> mesos-master can not found mesos-slave, and elect a new leader in a short 
> interval
> --
>
> Key: MESOS-6205
> URL: https://issues.apache.org/jira/browse/MESOS-6205
> Project: Mesos
>  Issue Type: Bug
>  Components: master
> Environment: ubuntu 12 x64, centos 6.5 x64, centos 7.2 x64
>Reporter: kasim
>
> I follow this 
> [doc][https://open.mesosphere.com/getting-started/install/#verifying-installation]
>  to setup mesos cluster.
> There are three vm(ubuntu 12, centos 6.5, centos 7.2).
> {code}
> $ cat /etc/hosts
> 10.142.55.190 zk1
> 10.142.55.196 zk2
> 10.142.55.202 zk3
> {code}
> config in each mathine:
> {code}
> $ cat /etc/mesos/zk
> zk://10.142.55.190:2181,10.142.55.196:2181,10.142.55.202:2181/mesos
> {code}
> 
> After start zookeeper, mesos-master and mesos-slave in three vm, I can view 
> the mesos webui(10.142.55.190:5050), but agents count is 0.
> After a little time, mesos page get error:
> {code}
> Failed to connect to 10.142.55.190:5050!
> Retrying in 16 seconds... 
> {code}
> (I found that zookeeper would elect a new leader in a short interval)
> 
> mesos-master cmd:
> {code}
> mesos-master --agent_ping_timeout="15secs" 
> --agent_reregister_timeout="10mins" --allocation_interval="1secs" 
> --allocator="HierarchicalDRF" --authenticate_agents="false" 
> --authenticate_frameworks="false" --authenticate_http_frameworks="false" 
> --authenticate_http_readonly="false" --authenticate_http_readwrite="false" 
> --authenticators="crammd5" --authorizers="local" --framework_sorter="drf" 
> --help="false" --hostname_lookup="true" --http_authenticators="basic" 
> --initialize_driver_logging="true" --ip="10.142.55.190" 
> --log_auto_initialize="true" --log_dir="/var/log/mesos" --logbufsecs="0" 
> --logging_level="INFO" --max_agent_ping_timeouts="5" 
> --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" 
> --port="5050" --quiet="false" --quorum="2" 
> --recovery_agent_removal_limit="100%" --registry="replicated_log" 
> --registry_fetch_timeout="1mins" --registry_store_timeout="20secs" 
> --registry_strict="false" --root_submissions="true" --user_sorter="drf" 
> --version="false" --webui_dir="/usr/share/mesos/webui" 
> --work_dir="/var/lib/mesos" 
> --zk="zk://10.142.55.190:2181,10.142.55.196:2181,10.142.55.202:2181/mesos"
> {code}
> mesos-slave cmd:
> {code}
> mesos-slave --appc_simple_discovery_uri_prefix="http://"; 
> --appc_store_dir="/tmp/mesos/store/appc" --authenticate_http_readonly="false" 
> --authenticate_http_readwrite="false" --authenticatee="crammd5" 
> --authentication_backoff_factor="1secs" --authorizer="local" 
> --cgroups_cpu_enable_pids_and_tids_count="false" --cgroups_enable_cfs="false" 
> --cgroups_hierarchy="/sys/fs/cgroup" --cgroups_limit_swap="false" 
> --cgroups_root="mesos" --container_disk_watch_interval="15secs" 
> --containerizers="mesos" --default_role="*" --disk_watch_interval="1mins" 
> --docker="docker" --docker_kill_orphans="true" 
> --docker_registry="https://registry-1.docker.io"; --docker_remove_delay="6hrs" 
> --docker_socket="/var/run/docker.sock" --docker_stop_timeout="0ns" 
> --docker_store_dir="/tmp/mesos/store/docker" 
> --docker_volume_checkpoint_dir="/var/run/mesos/isolators/docker/volume" 
> --enforce_container_disk_quota="false" 
> --executor_registration_timeout="1mins" 
> --executor_shutdown_grace_period="5secs" 
> --fetcher_cache_dir="/tmp/mesos/fetch" --fetcher_cache_size="2GB" 
> --frameworks_home="" --gc_delay="1weeks" --gc_disk_headroom="0.1" 
> --hadoop_home="" --help="false" --hostname="10.142.55.190" 
> --hostname_lookup="true" --http_authenticators="basic" 
> --http_command_executor="false" --image_provisioner_backend="copy" 
> --initialize_driver_logging="true" --ip="10.142.55.190" 
> --isolation="posix/cpu,posix/mem" --launcher="posix" 
> --launcher_dir="/usr/libexec/mesos" --log_dir="/var/log/mesos" 
> --logbufsecs="0" --logging_level="INFO" 
> --master="zk://10.142.55.190:2181,10.142.55.196:2181,10.

[jira] [Updated] (MESOS-6210) Master redirect with suffix gets in redirect loop

2016-09-19 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-6210:
--
Labels: newbie  (was: )

I think the bug is here: 
https://github.com/apache/mesos/blob/master/src/master/http.cpp#L2036

We need to do a "contains" check instead of an "==" check. cc 
[~haosd...@gmail.com]

[~drcrallen] Would you like to send a patch for this?



> Master redirect with suffix gets in redirect loop
> -
>
> Key: MESOS-6210
> URL: https://issues.apache.org/jira/browse/MESOS-6210
> Project: Mesos
>  Issue Type: Bug
>  Components: HTTP API
>Reporter: Charles Allen
>  Labels: newbie
>
> Trying to go to a URI like 
> {{http://SOME_MASTER:5050/master/redirect/master/frameworks}} ends up in a 
> redirect loop.
> The expected behavior is to either not support anything after {{redirect}} in 
> the path (redirect must be handled by a smart client), or to redirect to the 
> suffix (redirect can be handled by a dumb client).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6207) Python bindings fail to build with custom SVN installation path

2016-09-19 Thread Vinod Kone (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15505148#comment-15505148
 ] 

Vinod Kone commented on MESOS-6207:
---

[~tillt] is this something you are interested in shepherding?

> Python bindings fail to build with custom SVN installation path
> ---
>
> Key: MESOS-6207
> URL: https://issues.apache.org/jira/browse/MESOS-6207
> Project: Mesos
>  Issue Type: Bug
>  Components: build
>Affects Versions: 1.0.1
>Reporter: Ilya Pronin
>Priority: Trivial
>
> In {{src/Makefile.am}} {{PYTHON_LDFLAGS}} variable is used while building 
> Python bindings. This variable picks {{LDFLAGS}} during configuration phase 
> before we check for custom SVN installation path and misses 
> {{-L$\{with_svn\}/lib}} flag. That causes a link error on systems with 
> uncommon SVN installation path.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4668) Agent's /state endpoint does not include full reservation information

2016-09-19 Thread Yan Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15504971#comment-15504971
 ] 

Yan Xu commented on MESOS-4668:
---

As a follow up: MESOS-6211

> Agent's /state endpoint does not include full reservation information
> -
>
> Key: MESOS-4668
> URL: https://issues.apache.org/jira/browse/MESOS-4668
> Project: Mesos
>  Issue Type: Bug
>  Components: slave
>Reporter: Neil Conway
>Assignee: Yan Xu
>Priority: Minor
>  Labels: endpoint, reservations
> Fix For: 1.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6211) Add total resources to the agent operator API.

2016-09-19 Thread Yan Xu (JIRA)
Yan Xu created MESOS-6211:
-

 Summary: Add total resources to the agent operator API.
 Key: MESOS-6211
 URL: https://issues.apache.org/jira/browse/MESOS-6211
 Project: Mesos
  Issue Type: Improvement
Reporter: Yan Xu


Looks like it can be a field in 

{code}
message GetAgent {
  optional AgentInfo agent_info = 1;
+  repeated Resource total_resources = 2;
}
{code}

The point is to include reservations.

The name is consistent with 
[master.proto|https://github.com/apache/mesos/blob/29236068f23b6cfbd19d7a4b5b96be852818a356/include/mesos/v1/master/master.proto#L305].




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-6062) mesos-agent should autodetect mount-type volume sizes

2016-09-19 Thread Anindya Sinha (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15489143#comment-15489143
 ] 

Anindya Sinha edited comment on MESOS-6062 at 9/19/16 10:45 PM:


RRs published for review:
https://reviews.apache.org/r/51999
https://reviews.apache.org/r/52001
https://reviews.apache.org/r/52002
https://reviews.apache.org/r/51879
https://reviews.apache.org/r/51880
https://reviews.apache.org/r/52071/


was (Author: anindya.sinha):
RRs published for review:
https://reviews.apache.org/r/51999
https://reviews.apache.org/r/52001
https://reviews.apache.org/r/52002
https://reviews.apache.org/r/51879
https://reviews.apache.org/r/51880

> mesos-agent should autodetect mount-type volume sizes
> -
>
> Key: MESOS-6062
> URL: https://issues.apache.org/jira/browse/MESOS-6062
> Project: Mesos
>  Issue Type: Improvement
>  Components: slave
>Reporter: Yan Xu
>Assignee: Anindya Sinha
>
> When dealing with a large fleet of machines it could be cumbersome to 
> construct the resources JSON file that varies from host to host. Mesos 
> already auto-detects resources such as cpus, mem and "root" disk, it should 
> extend it to the MOUNT type disk as it's pretty clear that the value should 
> be the size of entire volume.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-4668) Agent's /state endpoint does not include full reservation information

2016-09-19 Thread Yan Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15491011#comment-15491011
 ] 

Yan Xu edited comment on MESOS-4668 at 9/19/16 10:35 PM:
-

https://reviews.apache.org/r/51866
https://reviews.apache.org/r/51867
https://reviews.apache.org/r/51868
https://reviews.apache.org/r/52070

The test (shared with MESOS-6085)
https://reviews.apache.org/r/51870


was (Author: xujyan):
https://reviews.apache.org/r/51866
https://reviews.apache.org/r/51867
https://reviews.apache.org/r/51868

The test (shared with MESOS-6085)
https://reviews.apache.org/r/51870

> Agent's /state endpoint does not include full reservation information
> -
>
> Key: MESOS-4668
> URL: https://issues.apache.org/jira/browse/MESOS-4668
> Project: Mesos
>  Issue Type: Bug
>  Components: slave
>Reporter: Neil Conway
>Assignee: Yan Xu
>Priority: Minor
>  Labels: endpoint, reservations
> Fix For: 1.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5821) Clean up the billions of compiler warnings on MSVC

2016-09-19 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15504857#comment-15504857
 ] 

Joseph Wu commented on MESOS-5821:
--

{code}
commit 4a91c1ed8c86ce2100dba24b321fc578ad3d0473
Author: Daniel Pravat 
Date:   Mon Sep 19 14:21:10 2016 -0700

Windows: Fixed warnings in `duration.hpp`.

Captures the result of `std::ostream::precision` via its return type
`std::streamsize` rather than implicitly casting to `long`.

Review: https://reviews.apache.org/r/52061/
{code}
{code}
commit a5993cec547f299a8edff7662599c68e208ecdac
Author: Daniel Pravat 
Date:   Mon Sep 19 14:51:12 2016 -0700

Windows: Fixed warnings in `windows/os.hpp`.

Captured the result of `Process32First` via its return type `BOOL`
rather than implicitly casting to `bool`.

Review: https://reviews.apache.org/r/52063/
{code}

> Clean up the billions of compiler warnings on MSVC
> --
>
> Key: MESOS-5821
> URL: https://issues.apache.org/jira/browse/MESOS-5821
> Project: Mesos
>  Issue Type: Bug
>  Components: slave
>Reporter: Alex Clemmer
>Assignee: Daniel Pravat
>  Labels: mesosphere, slave
>
> Clean builds of Mesos on Windows will result in approximately {{5800 
> Warning(s)}} or more.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5986) SSL Socket CHECK can fail after socket receives EOF

2016-09-19 Thread Joseph Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu updated MESOS-5986:
-
Fix Version/s: 0.27.4
   0.28.3

> SSL Socket CHECK can fail after socket receives EOF
> ---
>
> Key: MESOS-5986
> URL: https://issues.apache.org/jira/browse/MESOS-5986
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Affects Versions: 1.0.0
>Reporter: Greg Mann
>Assignee: Greg Mann
>Priority: Blocker
>  Labels: mesosphere
> Fix For: 0.28.3, 1.0.1, 0.27.4
>
>
> While writing a test for MESOS-3753, I encountered a bug where [this 
> check|https://github.com/apache/mesos/blob/853821cafcca3550b9c7bdaba5262d73869e2ee1/3rdparty/libprocess/src/libevent_ssl_socket.cpp#L708]
>  fails at the very end of the test body, while objects in the stack frame are 
> being destroyed. After adding some debug logging output, I produced the 
> following:
> {code}
> I0804 08:32:33.263211 273793024 libevent_ssl_socket.cpp:681] *** in send()17
> I0804 08:32:33.263209 273256448 process.cpp:2970] Cleaning up 
> __limiter__(3)@127.0.0.1:55688
> I0804 08:32:33.263263 275939328 libevent_ssl_socket.cpp:152] *** in 
> initialize(): 14
> I0804 08:32:33.263206 272719872 process.cpp:2865] Resuming 
> (61)@127.0.0.1:55688 at 2016-08-04 15:32:33.263261952+00:00
> I0804 08:32:33.263327 275939328 libevent_ssl_socket.cpp:584] *** in recv()14
> I0804 08:32:33.263337 272719872 hierarchical.cpp:571] Agent 
> e2a49340-34ec-403f-a5a4-15e29c4a2434-S0 deactivated
> I0804 08:32:33.263322 275402752 process.cpp:2865] Resuming 
> help@127.0.0.1:55688 at 2016-08-04 15:32:33.263343104+00:00
> I0804 08:32:33.263510 275939328 libevent_ssl_socket.cpp:322] *** in 
> event_callback(bev)
> I0804 08:32:33.263536 275939328 libevent_ssl_socket.cpp:353] *** in 
> event_callback check for EOF/CONNECTED/ERROR: 19
> I0804 08:32:33.263592 275939328 libevent_ssl_socket.cpp:159] *** in 
> shutdown(): 19
> I0804 08:32:33.263622 1985901312 process.cpp:3170] Donating thread to 
> (87)@127.0.0.1:55688 while waiting
> I0804 08:32:33.263639 274329600 process.cpp:2865] Resuming 
> __http__(12)@127.0.0.1:55688 at 2016-08-04 15:32:33.263653888+00:00
> I0804 08:32:33.263659 1985901312 process.cpp:2865] Resuming 
> (87)@127.0.0.1:55688 at 2016-08-04 15:32:33.263671040+00:00
> I0804 08:32:33.263730 1985901312 process.cpp:2970] Cleaning up 
> (87)@127.0.0.1:55688
> I0804 08:32:33.263741 275939328 libevent_ssl_socket.cpp:322] *** in 
> event_callback(bev)
> I0804 08:32:33.263736 274329600 process.cpp:2970] Cleaning up 
> __http__(12)@127.0.0.1:55688
> I0804 08:32:33.263778 275939328 libevent_ssl_socket.cpp:353] *** in 
> event_callback check for EOF/CONNECTED/ERROR: 17
> I0804 08:32:33.263818 275939328 libevent_ssl_socket.cpp:159] *** in 
> shutdown(): 17
> I0804 08:32:33.263839 272183296 process.cpp:2865] Resuming 
> help@127.0.0.1:55688 at 2016-08-04 15:32:33.263857920+00:00
> I0804 08:32:33.263933 273793024 process.cpp:2865] Resuming 
> __gc__@127.0.0.1:55688 at 2016-08-04 15:32:33.263951104+00:00
> I0804 08:32:33.264034 275939328 libevent_ssl_socket.cpp:681] *** in send()17
> I0804 08:32:33.264020 272719872 process.cpp:2865] Resuming 
> __http__(11)@127.0.0.1:55688 at 2016-08-04 15:32:33.264041984+00:00
> I0804 08:32:33.264036 274329600 process.cpp:2865] Resuming 
> status-update-manager(3)@127.0.0.1:55688 at 2016-08-04 
> 15:32:33.264056064+00:00
> I0804 08:32:33.264071 272719872 process.cpp:2970] Cleaning up 
> __http__(11)@127.0.0.1:55688
> I0804 08:32:33.264088 274329600 process.cpp:2970] Cleaning up 
> status-update-manager(3)@127.0.0.1:55688
> I0804 08:32:33.264086 275939328 libevent_ssl_socket.cpp:721] *** sending on 
> socket: 17, data: 0
> I0804 08:32:33.264112 272183296 process.cpp:2865] Resuming 
> (89)@127.0.0.1:55688 at 2016-08-04 15:32:33.264126976+00:00
> I0804 08:32:33.264118 275402752 process.cpp:2865] Resuming 
> help@127.0.0.1:55688 at 2016-08-04 15:32:33.264144896+00:00
> I0804 08:32:33.264149 272183296 process.cpp:2970] Cleaning up 
> (89)@127.0.0.1:55688
> I0804 08:32:33.264202 275939328 libevent_ssl_socket.cpp:281] *** in 
> send_callback(bev)
> I0804 08:32:33.264400 273793024 process.cpp:3170] Donating thread to 
> (86)@127.0.0.1:55688 while waiting
> I0804 08:32:33.264413 273256448 process.cpp:2865] Resuming 
> (76)@127.0.0.1:55688 at 2016-08-04 15:32:33.264428032+00:00
> I0804 08:32:33.296268 275939328 libevent_ssl_socket.cpp:300] *** in 
> send_callback(): 17
> I0804 08:32:33.296419 273256448 process.cpp:2970] Cleaning up 
> (76)@127.0.0.1:55688
> I0804 08:32:33.296357 273793024 process.cpp:2865] Resuming 
> (86)@127.0.0.1:55688 at 2016-08-04 15:32:33.296414976+00:00
> I0804 08:32:33.296464 273793024 process.cpp:2970] Cleaning up 
> (86)@127.0.0.1:55688
> I0804 08:32:33.

[jira] [Updated] (MESOS-6152) Resource leak in libevent_ssl_socket.cpp.

2016-09-19 Thread Joseph Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu updated MESOS-6152:
-
Fix Version/s: 1.0.2
   0.27.4
   0.28.3

> Resource leak in libevent_ssl_socket.cpp.
> -
>
> Key: MESOS-6152
> URL: https://issues.apache.org/jira/browse/MESOS-6152
> Project: Mesos
>  Issue Type: Bug
>Reporter: Joerg Schad
>Assignee: Benjamin Bannier
>  Labels: coverity
> Fix For: 0.28.3, 0.27.4, 1.1.0, 1.0.2
>
>
> Coverity detected the following resource leak.
> IMO {code} if (fd == -1) {code} should be  {code} if (owned_fd == -1) {code}.
> {code}
>  // Duplicate the file descriptor because Libevent will take ownership
> 754  // and control the lifecycle separately.
> 755  //
> 756  // TODO(josephw): We can avoid duplicating the file descriptor in
> 757  // future versions of Libevent. In Libevent versions 2.1.2 and later,
> 758  // we may use `evbuffer_file_segment_new` and `evbuffer_add_file_segment`
> 759  // instead of `evbuffer_add_file`.
>   3. open_fn: Returning handle opened by dup.
>   4. var_assign: Assigning: owned_fd = handle returned from dup(fd).
> 760  int owned_fd = dup(fd);
>   CID 1372873: Argument cannot be negative (REVERSE_NEGATIVE) [select 
> issue]
>   5. Condition fd == -1, taking true branch.
> 761  if (fd == -1) {
>   
> CID 1372872 (#1 of 1): Resource leak (RESOURCE_LEAK)
> 6. leaked_handle: Handle variable owned_fd going out of scope leaks the 
> handle.
> 762return Failure(ErrnoError("Failed to duplicate file descriptor"));
> 763  }
> {code}
> https://scan5.coverity.com/reports.htm#v39597/p10429/fileInstanceId=98881747&defectInstanceId=28450468



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6104) Potential FD double close in libevent's implementation of `sendfile`.

2016-09-19 Thread Joseph Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu updated MESOS-6104:
-
Fix Version/s: 1.0.2
   0.27.4
   0.28.3

> Potential FD double close in libevent's implementation of `sendfile`.
> -
>
> Key: MESOS-6104
> URL: https://issues.apache.org/jira/browse/MESOS-6104
> Project: Mesos
>  Issue Type: Bug
>  Components: webui
>Affects Versions: 0.27.3, 0.28.2, 1.0.1
>Reporter: Joseph Wu
>Assignee: Joseph Wu
>Priority: Critical
>  Labels: mesosphere, ssl
> Fix For: 0.28.3, 0.27.4, 1.1.0, 1.0.2
>
>
> Repro copied from: https://reviews.apache.org/r/51509/
> It is possible to make the master CHECK fail by repeatedly hitting the web UI 
> and reloading the static assets:
> 1) Paste lots of text (16KB or more) of text into 
> `src/webui/master/static/home.html`.  The more text, the more reliable the 
> repro.
> 2) Start the master with SSL enabled:
> {code}
> LIBPROCESS_SSL_ENABLED=true LIBPROCESS_SSL_KEY_FILE=key.pem 
> LIBPROCESS_SSL_CERT_FILE=cert.pem bin/mesos-master.sh --work_dir=/tmp/master
> {code}
> 3) Run two instances of this python script repeatedly:
> {code}
> import socket
> import ssl
> s = ssl.wrap_socket(socket.socket())
> s.connect(("localhost", 5050))
> s.sendall("""GET /static/home.html HTTP/1.1
> User-Agent: foobar
> Host: localhost:5050
> Accept: */*
> Connection: Keep-Alive
> """)
> # The HTTP part of the response
> print s.recv(1000)
> {code}
> i.e. 
> {code}
> while python test.py; do :; done & while python test.py; do :; done
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5909) Stout "OsTest.User" test can fail on some systems

2016-09-19 Thread Mao Geng (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15504625#comment-15504625
 ] 

Mao Geng commented on MESOS-5909:
-

Got it. Thanks! Look forward to addressing review comments. 

> Stout "OsTest.User" test can fail on some systems
> -
>
> Key: MESOS-5909
> URL: https://issues.apache.org/jira/browse/MESOS-5909
> Project: Mesos
>  Issue Type: Bug
>  Components: stout
>Reporter: Kapil Arya
>Assignee: Gilbert Song
>  Labels: mesosphere
> Attachments: MESOS-5909-fix.diff
>
>
> Libc call {{getgrouplist}} doesn't return the {{gid}} list in a sorted manner 
> (in my case, it's returning "471 100") ... whereas {{id -G}} return a sorted 
> list ("100 471" in my case) causing the validation inside the loop to fail.
> We should sort both lists before comparing the values.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-6180) Several tests are flaky, with futures timing out early

2016-09-19 Thread Greg Mann (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15497951#comment-15497951
 ] 

Greg Mann edited comment on MESOS-6180 at 9/19/16 8:38 PM:
---

Thanks for the patch to address the mount leak [~jieyu]! 
(https://reviews.apache.org/r/51963/)

I ran {{sudo MESOS_VERBOSE=1 GLOG_v=2 GTEST_REPEAT=-1 GTEST_BREAK_ON_FAILURE=1 
GTEST_FILTER="\*MesosContainerizerSlaveRecoveryTest.CGROUPS_ROOT_PidNamespace\*"
 bin/mesos-tests.sh}} and stressed my machine with {{stress -c N -i N -m N -d 
1}}, where {{N}} is number of cores, and I was able to reproduce a couple of 
these offer future timeout failures after a few tens of repetitions. I attached 
logs above as {{flaky-containerizer-pid-namespace-forward.txt}} and 
{{flaky-containerizer-pid-namespace-backward.txt}}.

We can see the master beginning agent registration, but we never see the line 
{{Registered agent ...}} from {{Master::_registerSlave()}}, which indicates 
that registration is complete and the registered message has been sent to the 
agent:
{code}
I0917 01:35:17.184216   480 master.cpp:4886] Registering agent at 
slave(11)@172.31.1.104:57341 (ip-172-31-1-104.us-west-2.compute.internal) with 
id fa7a42d0-5d0c-4799-b19f-2a85b43039f3-S0
I0917 01:35:17.184232   474 process.cpp:2707] Resuming 
__reaper__(1)@172.31.1.104:57341 at 2016-09-17 01:35:17.184222976+00:00
I0917 01:35:17.184377   474 process.cpp:2707] Resuming 
registrar(6)@172.31.1.104:57341 at 2016-09-17 01:35:17.184371968+00:00
I0917 01:35:17.184554   474 registrar.cpp:464] Applied 1 operations in 79217ns; 
attempting to update the registry
I0917 01:35:17.184953   474 process.cpp:2697] Spawned process 
__latch__(141)@172.31.1.104:57341
I0917 01:35:17.184990   485 process.cpp:2707] Resuming 
log-storage(6)@172.31.1.104:57341 at 2016-09-17 01:35:17.184982016+00:00
I0917 01:35:17.185561   485 process.cpp:2707] Resuming 
log-writer(6)@172.31.1.104:57341 at 2016-09-17 01:35:17.185552896+00:00
I0917 01:35:17.185609   485 log.cpp:577] Attempting to append 434 bytes to the 
log
I0917 01:35:17.185804   485 process.cpp:2707] Resuming 
log-coordinator(6)@172.31.1.104:57341 at 2016-09-17 01:35:17.185797888+00:00
I0917 01:35:17.185863   485 coordinator.cpp:348] Coordinator attempting to 
write APPEND action at position 3
I0917 01:35:17.185998   485 process.cpp:2697] Spawned process 
log-write(29)@172.31.1.104:57341
I0917 01:35:17.186030   475 process.cpp:2707] Resuming 
log-write(29)@172.31.1.104:57341 at 2016-09-17 01:35:17.186021888+00:00
I0917 01:35:17.186189   475 process.cpp:2707] Resuming 
log-network(6)@172.31.1.104:57341 at 2016-09-17 01:35:17.186182912+00:00
I0917 01:35:17.186275   475 process.cpp:2707] Resuming 
log-write(29)@172.31.1.104:57341 at 2016-09-17 01:35:17.186267904+00:00
I0917 01:35:17.186424   475 process.cpp:2707] Resuming 
log-network(6)@172.31.1.104:57341 at 2016-09-17 01:35:17.186416896+00:00
I0917 01:35:17.186575   475 process.cpp:2697] Spawned process 
__req_res__(55)@172.31.1.104:57341
I0917 01:35:17.186724   475 process.cpp:2707] Resuming 
log-write(29)@172.31.1.104:57341 at 2016-09-17 01:35:17.186717952+00:00
I0917 01:35:17.186609   485 process.cpp:2707] Resuming 
__req_res__(55)@172.31.1.104:57341 at 2016-09-17 01:35:17.186601984+00:00
I0917 01:35:17.186898   485 process.cpp:2707] Resuming 
log-replica(6)@172.31.1.104:57341 at 2016-09-17 01:35:17.186892032+00:00
I0917 01:35:17.186962   485 replica.cpp:537] Replica received write request for 
position 3 from __req_res__(55)@172.31.1.104:57341
I0917 01:35:17.185014   471 process.cpp:2707] Resuming 
__gc__@172.31.1.104:57341 at 2016-09-17 01:35:17.185008896+00:00
I0917 01:35:17.185036   480 process.cpp:2707] Resuming 
__latch__(141)@172.31.1.104:57341 at 2016-09-17 01:35:17.185029120+00:00
I0917 01:35:17.196358   482 process.cpp:2707] Resuming 
slave(11)@172.31.1.104:57341 at 2016-09-17 01:35:17.196335104+00:00
I0917 01:35:17.196900   482 slave.cpp:1471] Will retry registration in 
25.224033ms if necessary
I0917 01:35:17.197029   482 process.cpp:2707] Resuming 
master@172.31.1.104:57341 at 2016-09-17 01:35:17.197024000+00:00
I0917 01:35:17.197157   482 master.cpp:4874] Ignoring register agent message 
from slave(11)@172.31.1.104:57341 (ip-172-31-1-104.us-west-2.compute.internal) 
as admission is already in progress
I0917 01:35:17.224309   482 process.cpp:2707] Resuming 
slave(11)@172.31.1.104:57341 at 2016-09-17 01:35:17.224284928+00:00
I0917 01:35:17.224845   482 slave.cpp:1471] Will retry registration in 
63.510932ms if necessary
I0917 01:35:17.224900   475 process.cpp:2707] Resuming 
master@172.31.1.104:57341 at 2016-09-17 01:35:17.224888064+00:00
I0917 01:35:17.225109   475 master.cpp:4874] Ignoring register agent message 
from slave(11)@172.31.1.104:57341 (ip-172-31-1-104.us-west-2.compute.internal) 
as admission is already in progress
{code}


was (Author: greggomann):
Thanks for t

[jira] [Created] (MESOS-6210) Master redirect with suffix gets in redirect loop

2016-09-19 Thread Charles Allen (JIRA)
Charles Allen created MESOS-6210:


 Summary: Master redirect with suffix gets in redirect loop
 Key: MESOS-6210
 URL: https://issues.apache.org/jira/browse/MESOS-6210
 Project: Mesos
  Issue Type: Bug
  Components: HTTP API
Reporter: Charles Allen


Trying to go to a URI like 
{{http://SOME_MASTER:5050/master/redirect/master/frameworks}} ends up in a 
redirect loop.

The expected behavior is to either not support anything after {{redirect}} in 
the path (redirect must be handled by a smart client), or to redirect to the 
suffix (redirect can be handled by a dumb client).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3011) Publish release documentation for major releases on website

2016-09-19 Thread Tim Anderegg (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15504572#comment-15504572
 ] 

Tim Anderegg commented on MESOS-3011:
-

Whoops, fixed :)  Thanks!

> Publish release documentation for major releases on website
> ---
>
> Key: MESOS-3011
> URL: https://issues.apache.org/jira/browse/MESOS-3011
> Project: Mesos
>  Issue Type: Documentation
>  Components: documentation, project website
>Reporter: Paul Brett
>Assignee: Tim Anderegg
>  Labels: documentation, mesosphere
>
> Currently, the website only provides a single version of the documentation.  
> We should publish documentation for each release on the website independently 
> (for example as https://mesos.apache.org/documentation/0.22/index.html, 
> https://mesos.apache.org/documentation/0.23/index.html) and make latest 
> redirect to the current version.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3011) Publish release documentation for major releases on website

2016-09-19 Thread Neil Conway (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15504552#comment-15504552
 ] 

Neil Conway commented on MESOS-3011:


[~tanderegg] -- seems like the review request is private?

> Publish release documentation for major releases on website
> ---
>
> Key: MESOS-3011
> URL: https://issues.apache.org/jira/browse/MESOS-3011
> Project: Mesos
>  Issue Type: Documentation
>  Components: documentation, project website
>Reporter: Paul Brett
>Assignee: Tim Anderegg
>  Labels: documentation, mesosphere
>
> Currently, the website only provides a single version of the documentation.  
> We should publish documentation for each release on the website independently 
> (for example as https://mesos.apache.org/documentation/0.22/index.html, 
> https://mesos.apache.org/documentation/0.23/index.html) and make latest 
> redirect to the current version.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3011) Publish release documentation for major releases on website

2016-09-19 Thread Tim Anderegg (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15504544#comment-15504544
 ] 

Tim Anderegg commented on MESOS-3011:
-

[~vinodkone] Sorry this took so long, if you remember you offered to shepherd 
this change a whlie back, here's the review: https://reviews.apache.org/r/52064/

Let me know if you have any questions!

> Publish release documentation for major releases on website
> ---
>
> Key: MESOS-3011
> URL: https://issues.apache.org/jira/browse/MESOS-3011
> Project: Mesos
>  Issue Type: Documentation
>  Components: documentation, project website
>Reporter: Paul Brett
>Assignee: Tim Anderegg
>  Labels: documentation, mesosphere
>
> Currently, the website only provides a single version of the documentation.  
> We should publish documentation for each release on the website independently 
> (for example as https://mesos.apache.org/documentation/0.22/index.html, 
> https://mesos.apache.org/documentation/0.23/index.html) and make latest 
> redirect to the current version.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6130) Make the disk usage isolator nesting-aware

2016-09-19 Thread Greg Mann (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15504499#comment-15504499
 ] 

Greg Mann commented on MESOS-6130:
--

Jie made these changes to the disk/du isolator:
{code}
Commit: a4fd86bce2d2b6b53bce2ea95e3b69c5ab011ff8
Parents: 8221021
Author: Jie Yu 
Authored: Sat Sep 17 12:26:52 2016 -0700
Committer: Jie Yu 
Committed: Sat Sep 17 13:02:37 2016 -0700
{code}

Note that we still need to add tests for this, but we'll wait briefly for the 
rest of the containerizer changes to land before doing so.

> Make the disk usage isolator nesting-aware
> --
>
> Key: MESOS-6130
> URL: https://issues.apache.org/jira/browse/MESOS-6130
> Project: Mesos
>  Issue Type: Task
>  Components: isolation
>Reporter: Greg Mann
>Assignee: Greg Mann
>  Labels: mesosphere
>
> With the addition of task groups, the disk usage isolator must be updated. 
> Since sub-container sandboxes are nested within the parent container's 
> sandbox, the isolator must exclude these folders from its usage calculation 
> when examining the parent container's disk usage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5856) Logrotate ContainerLogger module does not rotate logs when run as root with --switch_user

2016-09-19 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15504438#comment-15504438
 ] 

Joseph Wu commented on MESOS-5856:
--

Added a unit test to assert the expected behavior: 
https://reviews.apache.org/r/52059/

> Logrotate ContainerLogger module does not rotate logs when run as root with 
> --switch_user
> -
>
> Key: MESOS-5856
> URL: https://issues.apache.org/jira/browse/MESOS-5856
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.27.0, 0.28.0, 1.0.0
>Reporter: Joseph Wu
>Assignee: Sivaram Kannan
>Priority: Critical
>  Labels: logger, mesosphere, newbie
>
> The logrotate ContainerLogger module runs as the agent's user.  In most 
> cases, this is {{root}}.
> When {{logrotate}} is run as root, there is an additional check the 
> configuration files must pass (because a root {{logrotate}} needs to be 
> secured against non-root modifications to the configuration):
> https://github.com/logrotate/logrotate/blob/fe80cb51a2571ca35b1a7c8ba0695db5a68feaba/config.c#L807-L815
> Log rotation will fail under the following scenario:
> 1) The agent is run with {{--switch_user}} (default: true)
> 2) A task is launched with a non-root user specified
> 3) The logrotate module spawns a few companion processes (as root) and this 
> creates the {{stdout}}, {{stderr}}, {{stdout.logrotate.conf}}, and 
> {{stderr.logrotate.conf}} files (as root).  This step races with the next 
> step.
> 4) The Mesos containerizer and Fetcher will {{chown}} the task's sandbox to 
> the non-root user.  Including the files just created.
> 5) When {{logrotate}} is run, it will skip any non-root configuration files.  
> This means the files are not rotated.
> 
> Fix: The logrotate module's companion processes should call {{setuid}} and 
> {{setgid}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5909) Stout "OsTest.User" test can fail on some systems

2016-09-19 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15504390#comment-15504390
 ] 

Joseph Wu commented on MESOS-5909:
--

Go ahead. 

All reviewers are specified by the person creating the ReviewBoard review.  No 
need to ask permission, but it's often good to ping the people you've added via 
JIRA/email/Slack.

> Stout "OsTest.User" test can fail on some systems
> -
>
> Key: MESOS-5909
> URL: https://issues.apache.org/jira/browse/MESOS-5909
> Project: Mesos
>  Issue Type: Bug
>  Components: stout
>Reporter: Kapil Arya
>Assignee: Gilbert Song
>  Labels: mesosphere
> Attachments: MESOS-5909-fix.diff
>
>
> Libc call {{getgrouplist}} doesn't return the {{gid}} list in a sorted manner 
> (in my case, it's returning "471 100") ... whereas {{id -G}} return a sorted 
> list ("100 471" in my case) causing the validation inside the loop to fail.
> We should sort both lists before comparing the values.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6002) The whiteout file cannot be removed correctly using aufs backend.

2016-09-19 Thread JIRA

[ 
https://issues.apache.org/jira/browse/MESOS-6002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15504388#comment-15504388
 ] 

Stéphane Cottin commented on MESOS-6002:


Same issue using overlayfs :

{code}
Failed to remove whiteout file 
'/mnt/mesos/provisioner/containers/001bbc00-e460-4c15-a445-e3dd44f3dd8c/backends/overlay/rootfses/acb9cba3-671d-41a8-ad73-9b160f3ca048/var/lib/apt/lists/partial/.wh..opq':
 No such file or directory
{code}

Can be reproduced with official postgres, rabbitmq and many more docker images, 
all deleting the same folder in multiple RUN calls.

> The whiteout file cannot be removed correctly using aufs backend.
> -
>
> Key: MESOS-6002
> URL: https://issues.apache.org/jira/browse/MESOS-6002
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
> Environment: Ubuntu 14, Ubuntu 12
> Or any os with aufs module
>Reporter: Gilbert Song
>  Labels: aufs, backend, containerizer
>
> The whiteout file is not removed correctly when using the aufs backend in 
> unified containerizer. It can be verified by this unit test with the aufs 
> manually specified.
> {noformat}
> [20:11:24] :   [Step 10/10] [ RUN  ] 
> ProvisionerDockerPullerTest.ROOT_INTERNET_CURL_Whiteout
> [20:11:24]W:   [Step 10/10] I0805 20:11:24.986734 24295 cluster.cpp:155] 
> Creating default 'local' authorizer
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.001153 24295 leveldb.cpp:174] 
> Opened db in 14.308627ms
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.003731 24295 leveldb.cpp:181] 
> Compacted db in 2.558329ms
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.003749 24295 leveldb.cpp:196] 
> Created db iterator in 3086ns
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.003754 24295 leveldb.cpp:202] 
> Seeked to beginning of db in 595ns
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.003758 24295 leveldb.cpp:271] 
> Iterated through 0 keys in the db in 314ns
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.003769 24295 replica.cpp:776] 
> Replica recovered with log positions 0 -> 0 with 1 holes and 0 unlearned
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.004086 24315 recover.cpp:451] 
> Starting replica recovery
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.004251 24312 recover.cpp:477] 
> Replica is in EMPTY status
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.004546 24314 replica.cpp:673] 
> Replica in EMPTY status received a broadcasted recover request from 
> __req_res__(5640)@172.30.2.105:36006
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.004607 24312 recover.cpp:197] 
> Received a recover response from a replica in EMPTY status
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.004762 24313 recover.cpp:568] 
> Updating replica status to STARTING
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.004776 24314 master.cpp:375] 
> Master 21665992-d47e-402f-a00c-6f8fab613019 (ip-172-30-2-105.mesosphere.io) 
> started on 172.30.2.105:36006
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.004787 24314 master.cpp:377] Flags 
> at startup: --acls="" --agent_ping_timeout="15secs" 
> --agent_reregister_timeout="10mins" --allocation_interval="1secs" 
> --allocator="HierarchicalDRF" --authenticate_agents="true" 
> --authenticate_frameworks="true" --authenticate_http_frameworks="true" 
> --authenticate_http_readonly="true" --authenticate_http_readwrite="true" 
> --authenticators="crammd5" --authorizers="local" 
> --credentials="/tmp/0z753P/credentials" --framework_sorter="drf" 
> --help="false" --hostname_lookup="true" --http_authenticators="basic" 
> --http_framework_authenticators="basic" --initialize_driver_logging="true" 
> --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" 
> --max_agent_ping_timeouts="5" --max_completed_frameworks="50" 
> --max_completed_tasks_per_framework="1000" --quiet="false" 
> --recovery_agent_removal_limit="100%" --registry="replicated_log" 
> --registry_fetch_timeout="1mins" --registry_store_timeout="100secs" 
> --registry_strict="true" --root_submissions="true" --user_sorter="drf" 
> --version="false" --webui_dir="/usr/local/share/mesos/webui" 
> --work_dir="/tmp/0z753P/master" --zk_session_timeout="10secs"
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.004920 24314 master.cpp:427] 
> Master only allowing authenticated frameworks to register
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.004930 24314 master.cpp:441] 
> Master only allowing authenticated agents to register
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.004935 24314 master.cpp:454] 
> Master only allowing authenticated HTTP frameworks to register
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.004942 24314 credentials.hpp:37] 
> Loading credentials for authentication from '/tmp/0z753P/credentials'
> [20:11:25]W:   [Step 10/10] I0805 20:11:25.005018 24314 master.cpp:499] Using 
> default 'crammd5' aut

[jira] [Commented] (MESOS-5909) Stout "OsTest.User" test can fail on some systems

2016-09-19 Thread Mao Geng (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15504307#comment-15504307
 ] 

Mao Geng commented on MESOS-5909:
-

Thanks [~kaysoky]! I created https://reviews.apache.org/r/52048/ and added 
[~gilbert]  [~karya] as reviewers. May I add you as a reviewer too? 

> Stout "OsTest.User" test can fail on some systems
> -
>
> Key: MESOS-5909
> URL: https://issues.apache.org/jira/browse/MESOS-5909
> Project: Mesos
>  Issue Type: Bug
>  Components: stout
>Reporter: Kapil Arya
>Assignee: Gilbert Song
>  Labels: mesosphere
> Attachments: MESOS-5909-fix.diff
>
>
> Libc call {{getgrouplist}} doesn't return the {{gid}} list in a sorted manner 
> (in my case, it's returning "471 100") ... whereas {{id -G}} return a sorted 
> list ("100 471" in my case) causing the validation inside the loop to fail.
> We should sort both lists before comparing the values.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6205) mesos-master can not found mesos-slave, and elect a new leader in a short interval

2016-09-19 Thread Joseph Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu updated MESOS-6205:
-
Description: 
I follow this 
[doc][https://open.mesosphere.com/getting-started/install/#verifying-installation]
 to setup mesos cluster.

There are three vm(ubuntu 12, centos 6.5, centos 7.2).
{code}
$ cat /etc/hosts
10.142.55.190 zk1
10.142.55.196 zk2
10.142.55.202 zk3
{code}
config in each mathine:
{code}
$ cat /etc/mesos/zk
zk://10.142.55.190:2181,10.142.55.196:2181,10.142.55.202:2181/mesos
{code}

After start zookeeper, mesos-master and mesos-slave in three vm, I can view the 
mesos webui(10.142.55.190:5050), but agents count is 0.
After a little time, mesos page get error:
{code}
Failed to connect to 10.142.55.190:5050!
Retrying in 16 seconds... 
{code}
(I found that zookeeper would elect a new leader in a short interval)


mesos-master cmd:
{code}
mesos-master --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
--allocation_interval="1secs" --allocator="HierarchicalDRF" 
--authenticate_agents="false" --authenticate_frameworks="false" 
--authenticate_http_frameworks="false" --authenticate_http_readonly="false" 
--authenticate_http_readwrite="false" --authenticators="crammd5" 
--authorizers="local" --framework_sorter="drf" --help="false" 
--hostname_lookup="true" --http_authenticators="basic" 
--initialize_driver_logging="true" --ip="10.142.55.190" 
--log_auto_initialize="true" --log_dir="/var/log/mesos" --logbufsecs="0" 
--logging_level="INFO" --max_agent_ping_timeouts="5" 
--max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" 
--port="5050" --quiet="false" --quorum="2" 
--recovery_agent_removal_limit="100%" --registry="replicated_log" 
--registry_fetch_timeout="1mins" --registry_store_timeout="20secs" 
--registry_strict="false" --root_submissions="true" --user_sorter="drf" 
--version="false" --webui_dir="/usr/share/mesos/webui" 
--work_dir="/var/lib/mesos" 
--zk="zk://10.142.55.190:2181,10.142.55.196:2181,10.142.55.202:2181/mesos"
{code}

mesos-slave cmd:
{code}
mesos-slave --appc_simple_discovery_uri_prefix="http://"; 
--appc_store_dir="/tmp/mesos/store/appc" --authenticate_http_readonly="false" 
--authenticate_http_readwrite="false" --authenticatee="crammd5" 
--authentication_backoff_factor="1secs" --authorizer="local" 
--cgroups_cpu_enable_pids_and_tids_count="false" --cgroups_enable_cfs="false" 
--cgroups_hierarchy="/sys/fs/cgroup" --cgroups_limit_swap="false" 
--cgroups_root="mesos" --container_disk_watch_interval="15secs" 
--containerizers="mesos" --default_role="*" --disk_watch_interval="1mins" 
--docker="docker" --docker_kill_orphans="true" 
--docker_registry="https://registry-1.docker.io"; --docker_remove_delay="6hrs" 
--docker_socket="/var/run/docker.sock" --docker_stop_timeout="0ns" 
--docker_store_dir="/tmp/mesos/store/docker" 
--docker_volume_checkpoint_dir="/var/run/mesos/isolators/docker/volume" 
--enforce_container_disk_quota="false" --executor_registration_timeout="1mins" 
--executor_shutdown_grace_period="5secs" --fetcher_cache_dir="/tmp/mesos/fetch" 
--fetcher_cache_size="2GB" --frameworks_home="" --gc_delay="1weeks" 
--gc_disk_headroom="0.1" --hadoop_home="" --help="false" 
--hostname="10.142.55.190" --hostname_lookup="true" 
--http_authenticators="basic" --http_command_executor="false" 
--image_provisioner_backend="copy" --initialize_driver_logging="true" 
--ip="10.142.55.190" --isolation="posix/cpu,posix/mem" --launcher="posix" 
--launcher_dir="/usr/libexec/mesos" --log_dir="/var/log/mesos" --logbufsecs="0" 
--logging_level="INFO" 
--master="zk://10.142.55.190:2181,10.142.55.196:2181,10.142.55.202:2181/mesos" 
--oversubscribed_resources_interval="15secs" --perf_duration="10secs" 
--perf_interval="1mins" --port="5051" --qos_correction_interval_min="0ns" 
--quiet="false" --recover="reconnect" --recovery_timeout="15mins" 
--registration_backoff_factor="1secs" --revocable_cpu_low_priority="true" 
--sandbox_directory="/mnt/mesos/sandbox" --strict="true" --switch_user="true" 
--systemd_enable_support="true" 
--systemd_runtime_directory="/run/systemd/system" --version="false" 
--work_dir="/var/lib/mesos"
{code}

When I run mesos-master from command-line, I got 

{code}
I0919 17:20:19.286264 17550 replica.cpp:673] Replica in VOTING status received 
a broadcasted recover request from (583)@10.142.55.202:5050
F0919 17:20:20.009371 17556 master.cpp:1536] Recovery failed: Failed to recover 
registrar: Failed to perform fetch within 1mins
*** Check failure stack trace: ***
@ 0x7f9db78458dd  google::LogMessage::Fail()
@ 0x7f9db784771d  google::LogMessage::SendToLog()
@ 0x7f9db78454cc  google::LogMessage::Flush()
@ 0x7f9db7848019  google::LogMessageFatal::~LogMessageFatal()
@ 0x7f9db6e2dbbc  mesos::internal::master::fail()
@

[jira] [Updated] (MESOS-6127) Implement suppport for HTTP/2

2016-09-19 Thread Aaron Wood (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aaron Wood updated MESOS-6127:
--
Description: 
HTTP/2 will allow us to take advantage of connection multiplexing, header 
compression, streams, server push, etc. Add support for communication over 
HTTP/2 between masters and agents, framework endpoints, etc.

Should we support HTTP/2 without TLS? The spec allows for this but most major 
browser vendors, libraries, and implementations aren't supporting it unless TLS 
is used. If we do require TLS, what can be done to reduce the performance hit 
of the TLS handshake? Might need to change more code to make sure that we are 
taking advantage of connection sharing so that we can (ideally) only ever have 
a one-time TLS handshake per shared connection.

Some ideas for libs:

https://nghttp2.org/documentation/package_README.html - Has encoders/decoders 
supporting HPACK https://nghttp2.org/documentation/tutorial-hpack.html
https://nghttp2.org/documentation/libnghttp2_asio.html - Currently marked as 
experimental by the nghttp2 docs

  was:
HTTP/2 will allow us to take advantage of connection multiplexing, header 
compression, streams, server push, etc. Add support for communication over 
HTTP/2 between masters and agents, framework endpoints, etc.

Should we support HTTP/2 without TLS? The spec allows for this but most major 
browser vendors, libraries, and implementations aren't supporting it unless TLS 
is used. If we do require TLS, what can be done to reduce the performance hit 
of the TLS handshake? Might need to change more code to make sure that we are 
taking advantage of connection sharing so that we can (ideally) only ever have 
a one-time TLS handshake per shared connection.


> Implement suppport for HTTP/2
> -
>
> Key: MESOS-6127
> URL: https://issues.apache.org/jira/browse/MESOS-6127
> Project: Mesos
>  Issue Type: Epic
>  Components: HTTP API, libprocess
>Reporter: Aaron Wood
>  Labels: performance
>
> HTTP/2 will allow us to take advantage of connection multiplexing, header 
> compression, streams, server push, etc. Add support for communication over 
> HTTP/2 between masters and agents, framework endpoints, etc.
> Should we support HTTP/2 without TLS? The spec allows for this but most major 
> browser vendors, libraries, and implementations aren't supporting it unless 
> TLS is used. If we do require TLS, what can be done to reduce the performance 
> hit of the TLS handshake? Might need to change more code to make sure that we 
> are taking advantage of connection sharing so that we can (ideally) only ever 
> have a one-time TLS handshake per shared connection.
> Some ideas for libs:
> https://nghttp2.org/documentation/package_README.html - Has encoders/decoders 
> supporting HPACK https://nghttp2.org/documentation/tutorial-hpack.html
> https://nghttp2.org/documentation/libnghttp2_asio.html - Currently marked as 
> experimental by the nghttp2 docs



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6202) Docker containerizer kills containers whose name starts with 'mesos-'

2016-09-19 Thread Anand Mazumdar (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15504216#comment-15504216
 ] 

Anand Mazumdar commented on MESOS-6202:
---

We undertook the work to clean up orphaned docker containers correctly as part 
of MESOS-3573. Unfortunately, this modified behavior should have been part of 
the 1.0 {{CHANGELOG}} but somehow was missed. As [~haosd...@gmail.com] 
suggested, you can use the {{docker_kill_orphans}} to get the previous 
behavior. 

We currently don't look to see if the {{id}} also has a valid UUID as you had 
pointed out. It seems orthogonal to this issue though. [~h0tbird] Can you file 
a separate issue for that?

> Docker containerizer kills containers whose name starts with 'mesos-'
> -
>
> Key: MESOS-6202
> URL: https://issues.apache.org/jira/browse/MESOS-6202
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, docker
>Affects Versions: 1.0.1
> Environment: Dockerized 
> {{mesosphere/mesos-slave:1.0.1-2.0.93.ubuntu1404}}
>Reporter: Marc Villacorta
>
> I run 3 docker containers in my CoreOS system whose names start with 
> _'mesos-'_ those are: _'mesos-master'_, _'mesos-dns'_ and _'mesos-agent'_.
> I can start the first two without any problem but when I start the third one 
> _('mesos-agent')_ all three containers are killed by the docker daemon.
> If I rename the containers to _'m3s0s-master'_, _'m3s0s-dns'_ and 
> _'m3s0s-agent'_ everything works.
> I tracked down the problem to 
> [this|https://github.com/apache/mesos/blob/16a563aca1f226b021b8f8815c4d115a3212f02b/src/slave/containerizer/docker.cpp#L116-L120]
>  code which is marked to be removed after deprecation cycle.
> I was previously running Mesos 0.28.2 without this problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6202) Docker containerizer kills containers whose name starts with 'mesos-'

2016-09-19 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15504163#comment-15504163
 ] 

Joseph Wu commented on MESOS-6202:
--

{quote}
I was previously running Mesos 0.28.2 without this problem.
{quote}
This code has been unchanged since 0.23, so you should be hitting the same 
problem regardless of what version you are using.

> Docker containerizer kills containers whose name starts with 'mesos-'
> -
>
> Key: MESOS-6202
> URL: https://issues.apache.org/jira/browse/MESOS-6202
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, docker
>Affects Versions: 1.0.1
> Environment: Dockerized 
> {{mesosphere/mesos-slave:1.0.1-2.0.93.ubuntu1404}}
>Reporter: Marc Villacorta
>
> I run 3 docker containers in my CoreOS system whose names start with 
> _'mesos-'_ those are: _'mesos-master'_, _'mesos-dns'_ and _'mesos-agent'_.
> I can start the first two without any problem but when I start the third one 
> _('mesos-agent')_ all three containers are killed by the docker daemon.
> If I rename the containers to _'m3s0s-master'_, _'m3s0s-dns'_ and 
> _'m3s0s-agent'_ everything works.
> I tracked down the problem to 
> [this|https://github.com/apache/mesos/blob/16a563aca1f226b021b8f8815c4d115a3212f02b/src/slave/containerizer/docker.cpp#L116-L120]
>  code which is marked to be removed after deprecation cycle.
> I was previously running Mesos 0.28.2 without this problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5909) Stout "OsTest.User" test can fail on some systems

2016-09-19 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15504128#comment-15504128
 ] 

Joseph Wu commented on MESOS-5909:
--

We prefer ReviewBoard for all C++ and non-trivial changes.  

This is partly for Apache-Foundation-recommended reasons (they want to keep 
records of everything) and because the Github repo is a read-only mirror :)

> Stout "OsTest.User" test can fail on some systems
> -
>
> Key: MESOS-5909
> URL: https://issues.apache.org/jira/browse/MESOS-5909
> Project: Mesos
>  Issue Type: Bug
>  Components: stout
>Reporter: Kapil Arya
>Assignee: Gilbert Song
>  Labels: mesosphere
> Attachments: MESOS-5909-fix.diff
>
>
> Libc call {{getgrouplist}} doesn't return the {{gid}} list in a sorted manner 
> (in my case, it's returning "471 100") ... whereas {{id -G}} return a sorted 
> list ("100 471" in my case) causing the validation inside the loop to fail.
> We should sort both lists before comparing the values.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6205) mesos-master can not found mesos-slave, and elect a new leader in a short interval

2016-09-19 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15504119#comment-15504119
 ] 

Joseph Wu commented on MESOS-6205:
--

There are two repeating log messages that tell you (indirectly) that something 
is wrong:
{code}
I0919 15:55:08.178272 13280 replica.cpp:673] Replica in VOTING status received 
a broadcasted recover request from (14)@10.142.55.202:5050
{code}
This message means that you've started this master before, with the same work 
directory.  It has some sort of persistent state in its work directory.

This log message tells you that there are two masters you have *not* started 
before:
{code}
I0919 15:55:16.018023 13282 consensus.cpp:360] Aborting implicit promise 
request because 2 ignores received
{code}

The masters will refuse to start because there is less than a quorum of masters 
with the persistent state.  If the masters were to start, you would have 
potential data loss.  This is the expected behavior, as Mesos errs on the side 
of caution.  

I'm assuming you want a fresh cluster (no prior state); you can fix this by 
deleting the work directory of the master on the {{10.142.55.202}} node.  If 
none of the masters have any prior state, they will reach consensus.

> mesos-master can not found mesos-slave, and elect a new leader in a short 
> interval
> --
>
> Key: MESOS-6205
> URL: https://issues.apache.org/jira/browse/MESOS-6205
> Project: Mesos
>  Issue Type: Bug
>  Components: master
> Environment: ubuntu 12 x64, centos 6.5 x64, centos 7.2 x64
>Reporter: kasim
>
> I follow this 
> [doc][https://open.mesosphere.com/getting-started/install/#verifying-installation]
>  to setup mesos cluster.
> There are three vm(ubuntu 12, centos 6.5, centos 7.2).
> $ cat /etc/hosts
> 10.142.55.190 zk1
> 10.142.55.196 zk2
> 10.142.55.202 zk3
> config in each mathine:
> $ cat /etc/mesos/zk
> zk://10.142.55.190:2181,10.142.55.196:2181,10.142.55.202:2181/mesos
> 
> After start zookeeper, mesos-master and mesos-slave in three vm, I can view 
> the mesos webui(10.142.55.190:5050), but agents count is 0.
> After a little time, mesos page get error:
> Failed to connect to 10.142.55.190:5050!
> Retrying in 16 seconds... 
> (I found that zookeeper would elect a new leader in a short interval)
> 
> mesos-master cmd:
> ```
> mesos-master --agent_ping_timeout="15secs" 
> --agent_reregister_timeout="10mins" --allocation_interval="1secs" 
> --allocator="HierarchicalDRF" --authenticate_agents="false" 
> --authenticate_frameworks="false" --authenticate_http_frameworks="false" 
> --authenticate_http_readonly="false" --authenticate_http_readwrite="false" 
> --authenticators="crammd5" --authorizers="local" --framework_sorter="drf" 
> --help="false" --hostname_lookup="true" --http_authenticators="basic" 
> --initialize_driver_logging="true" --ip="10.142.55.190" 
> --log_auto_initialize="true" --log_dir="/var/log/mesos" --logbufsecs="0" 
> --logging_level="INFO" --max_agent_ping_timeouts="5" 
> --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" 
> --port="5050" --quiet="false" --quorum="2" 
> --recovery_agent_removal_limit="100%" --registry="replicated_log" 
> --registry_fetch_timeout="1mins" --registry_store_timeout="20secs" 
> --registry_strict="false" --root_submissions="true" --user_sorter="drf" 
> --version="false" --webui_dir="/usr/share/mesos/webui" 
> --work_dir="/var/lib/mesos" 
> --zk="zk://10.142.55.190:2181,10.142.55.196:2181,10.142.55.202:2181/mesos"
> ```
> mesos-slave cmd:
> ```
> mesos-slave --appc_simple_discovery_uri_prefix="http://"; 
> --appc_store_dir="/tmp/mesos/store/appc" --authenticate_http_readonly="false" 
> --authenticate_http_readwrite="false" --authenticatee="crammd5" 
> --authentication_backoff_factor="1secs" --authorizer="local" 
> --cgroups_cpu_enable_pids_and_tids_count="false" --cgroups_enable_cfs="false" 
> --cgroups_hierarchy="/sys/fs/cgroup" --cgroups_limit_swap="false" 
> --cgroups_root="mesos" --container_disk_watch_interval="15secs" 
> --containerizers="mesos" --default_role="*" --disk_watch_interval="1mins" 
> --docker="docker" --docker_kill_orphans="true" 
> --docker_registry="https://registry-1.docker.io"; --docker_remove_delay="6hrs" 
> --docker_socket="/var/run/docker.sock" --docker_stop_timeout="0ns" 
> --docker_store_dir="/tmp/mesos/store/docker" 
> --docker_volume_checkpoint_dir="/var/run/mesos/isolators/docker/volume" 
> --enforce_container_disk_quota="false" 
> --executor_registration_timeout="1mins" 
> --executor_shutdown_grace_period="5secs" 
> --fetcher_cache_dir="/tmp/mesos/fetch" --fetcher_cache_size="2GB" 
> --frameworks_home="" --gc_delay="1weeks" --gc_disk_headroom="

[jira] [Comment Edited] (MESOS-6180) Several tests are flaky, with futures timing out early

2016-09-19 Thread haosdent (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15500170#comment-15500170
 ] 

haosdent edited comment on MESOS-6180 at 9/19/16 4:13 PM:
--

Try to reproduce with stress in an aws instance (16 cpus, 30 gb mem, Ubuntu 
14.04, c4.4xlarge), but could not reproduce after 5429 iterations as well.


was (Author: haosd...@gmail.com):
Try to reproduce with stress in an aws instance (16 cpus, 32 gb mem, Ubuntu 
14.04), but could not reproduce after 5429 iterations as well.

> Several tests are flaky, with futures timing out early
> --
>
> Key: MESOS-6180
> URL: https://issues.apache.org/jira/browse/MESOS-6180
> Project: Mesos
>  Issue Type: Bug
>  Components: tests
>Reporter: Greg Mann
>Assignee: haosdent
>  Labels: mesosphere, tests
> Attachments: CGROUPS_ROOT_PidNamespaceBackward.log, 
> CGROUPS_ROOT_PidNamespaceForward.log, FetchAndStoreAndStoreAndFetch.log, 
> flaky-containerizer-pid-namespace-backward.txt, 
> flaky-containerizer-pid-namespace-forward.txt
>
>
> Following the merging of a large patch chain, it was noticed on our internal 
> CI that several tests had become flaky, with a similar pattern in the 
> failures: the tests fail early when a future times out. Often, this occurs 
> when a test cluster is being spun up and one of the offer futures times out. 
> This has been observed in the following tests:
> * MesosContainerizerSlaveRecoveryTest.CGROUPS_ROOT_PidNamespaceForward
> * MesosContainerizerSlaveRecoveryTest.CGROUPS_ROOT_PidNamespaceBackward
> * ZooKeeperStateTest.FetchAndStoreAndStoreAndFetch
> * RoleTest.ImplicitRoleRegister
> * SlaveRecoveryTest/0.MultipleFrameworks
> * SlaveRecoveryTest/0.ReconcileShutdownFramework
> * SlaveTest.ContainerizerUsageFailure
> * MesosSchedulerDriverTest.ExplicitAcknowledgements
> * SlaveRecoveryTest/0.ReconnectHTTPExecutor (MESOS-6164)
> * ResourceOffersTest.ResourcesGetReofferedAfterTaskInfoError (MESOS-6165)
> * SlaveTest.CommandTaskWithKillPolicy (MESOS-6166)
> See the linked JIRAs noted above for individual tickets addressing a couple 
> of these.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-6159) Remove stout's Set type

2016-09-19 Thread Benjamin Bannier (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Bannier reassigned MESOS-6159:
---

Assignee: Benjamin Bannier

> Remove stout's Set type
> ---
>
> Key: MESOS-6159
> URL: https://issues.apache.org/jira/browse/MESOS-6159
> Project: Mesos
>  Issue Type: Bug
>  Components: stout
>Reporter: Benjamin Bannier
>Assignee: Benjamin Bannier
>Priority: Minor
>  Labels: tech-debt
>
> stout provides a {{Set}} type which wraps a {{std::set}}. As only addition it 
> provides new constructors,
> {code}
> Set(const T& t1);
> Set(const T& t1, const T& t2);
> Set(const T& t1, const T& t2, const T& t3);
> Set(const T& t1, const T& t2, const T& t3, const T& t4);
> {code}
> which simplified creation of a {{Set}} from (up to four) known elements.
> C++11 brought {{std::initializer_list}} which can be used to create a 
> {{std::set}} from an arbitrary number of elements, so it appears that it 
> should be possible to retire {{Set}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6209) Containers that use the Mesos containerizer but don't want to provision a container image fail to validate.

2016-09-19 Thread Jan Schlicht (JIRA)
Jan Schlicht created MESOS-6209:
---

 Summary: Containers that use the Mesos containerizer but don't 
want to provision a container image fail to validate.
 Key: MESOS-6209
 URL: https://issues.apache.org/jira/browse/MESOS-6209
 Project: Mesos
  Issue Type: Bug
  Components: containerization
 Environment: Mesos HEAD, change was introduced with 
e65f580bf0cbea64cedf521cf169b9b4c9f85454
Reporter: Jan Schlicht


Tasks using  features like volumes or CNI in their containers, have to define 
these in {{TaskInfo.container}}. When these tasks don't want/need to provision 
a container image, neither {{ContainerInfo.docker}} nor {{ContainerInfo.mesos}} 
will be set. Nevertheless, the container type in {{ContainerInfo.type}} needs 
to be set, because it is a required field.
In that case, the recently introduced validation rules in 
{{master/validation.cpp}} ({{validateContainerInfo}} will fail, which isn't 
expected.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6208) Containers that use the Mesos containerizer but don't want to provision a container image fail to validate.

2016-09-19 Thread Jan Schlicht (JIRA)
Jan Schlicht created MESOS-6208:
---

 Summary: Containers that use the Mesos containerizer but don't 
want to provision a container image fail to validate.
 Key: MESOS-6208
 URL: https://issues.apache.org/jira/browse/MESOS-6208
 Project: Mesos
  Issue Type: Bug
  Components: containerization
 Environment: Mesos HEAD, change was introduced with 
e65f580bf0cbea64cedf521cf169b9b4c9f85454
Reporter: Jan Schlicht


Tasks using  features like volumes or CNI in their containers, have to define 
these in {{TaskInfo.container}}. When these tasks don't want/need to provision 
a container image, neither {{ContainerInfo.docker}} nor {{ContainerInfo.mesos}} 
will be set. Nevertheless, the container type in {{ContainerInfo.type}} needs 
to be set, because it is a required field.
In that case, the recently introduced validation rules in 
{{master/validation.cpp}} ({{validateContainerInfo}} will fail, which isn't 
expected.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-6207) Python bindings fail to build with custom SVN installation path

2016-09-19 Thread Ilya Pronin (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15503630#comment-15503630
 ] 

Ilya Pronin edited comment on MESOS-6207 at 9/19/16 2:23 PM:
-

Proposed solution is to move Python related configuration steps down after all 
others. If that's OK I'd like to post a patch when my {{contributors.yaml}} PR 
will be merged.


was (Author: ipronin):
Proposed solution is to move Python related configuration steps down after all 
others. If that's OK I'd like to post a patch when my {{contributors.yaml}} PR 
will get merged.

> Python bindings fail to build with custom SVN installation path
> ---
>
> Key: MESOS-6207
> URL: https://issues.apache.org/jira/browse/MESOS-6207
> Project: Mesos
>  Issue Type: Bug
>  Components: build
>Affects Versions: 1.0.1
>Reporter: Ilya Pronin
>Priority: Trivial
>
> In {{src/Makefile.am}} {{PYTHON_LDFLAGS}} variable is used while building 
> Python bindings. This variable picks {{LDFLAGS}} during configuration phase 
> before we check for custom SVN installation path and misses 
> {{-L$\{with_svn\}/lib}} flag. That causes a link error on systems with 
> uncommon SVN installation path.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6207) Python bindings fail to build with custom SVN installation path

2016-09-19 Thread Ilya Pronin (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15503630#comment-15503630
 ] 

Ilya Pronin commented on MESOS-6207:


Proposed solution is to move Python related configuration steps down after all 
others. If that's OK I'd like to post a patch when my {{contributors.yaml}} PR 
will get merged.

> Python bindings fail to build with custom SVN installation path
> ---
>
> Key: MESOS-6207
> URL: https://issues.apache.org/jira/browse/MESOS-6207
> Project: Mesos
>  Issue Type: Bug
>  Components: build
>Affects Versions: 1.0.1
>Reporter: Ilya Pronin
>Priority: Trivial
>
> In {{src/Makefile.am}} {{PYTHON_LDFLAGS}} variable is used while building 
> Python bindings. This variable picks {{LDFLAGS}} during configuration phase 
> before we check for custom SVN installation path and misses 
> {{-L$\{with_svn\}/lib}} flag. That causes a link error on systems with 
> uncommon SVN installation path.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-5275) Add capabilities support for unified containerizer.

2016-09-19 Thread Benjamin Bannier (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15386728#comment-15386728
 ] 

Benjamin Bannier edited comment on MESOS-5275 at 9/19/16 2:21 PM:
--

Reviews:
https://reviews.apache.org/r/50271
https://reviews.apache.org/r/51930
https://reviews.apache.org/r/51931


was (Author: bbannier):
Review: https://reviews.apache.org/r/50271/

> Add capabilities support for unified containerizer.
> ---
>
> Key: MESOS-5275
> URL: https://issues.apache.org/jira/browse/MESOS-5275
> Project: Mesos
>  Issue Type: Task
>  Components: containerization
>Reporter: Jojy Varghese
>Assignee: Benjamin Bannier
>  Labels: mesosphere
>
> Add capabilities support for unified containerizer. 
> Requirements:
> 1. Use the mesos capabilities API.
> 2. Frameworks be able to add capability requests for containers.
> 3. Agents be able to add maximum allowed capabilities for all containers 
> launched.
> Design document: 
> https://docs.google.com/document/d/1YiTift8TQla2vq3upQr7K-riQ_pQ-FKOCOsysQJROGc/edit#heading=h.rgfwelqrskmd



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6207) Python bindings fail to build with custom SVN installation path

2016-09-19 Thread Ilya Pronin (JIRA)
Ilya Pronin created MESOS-6207:
--

 Summary: Python bindings fail to build with custom SVN 
installation path
 Key: MESOS-6207
 URL: https://issues.apache.org/jira/browse/MESOS-6207
 Project: Mesos
  Issue Type: Bug
  Components: build
Affects Versions: 1.0.1
Reporter: Ilya Pronin
Priority: Trivial


In {{src/Makefile.am}} {{PYTHON_LDFLAGS}} variable is used while building 
Python bindings. This variable picks {{LDFLAGS}} during configuration phase 
before we check for custom SVN installation path and misses 
{{-L$\{with_svn\}/lib}} flag. That causes a link error on systems with uncommon 
SVN installation path.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-5320) SSL related error messages can be misguiding or incomplete

2016-09-19 Thread Till Toenshoff (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Toenshoff reassigned MESOS-5320:
-

Assignee: Till Toenshoff

> SSL related error messages can be misguiding or incomplete
> --
>
> Key: MESOS-5320
> URL: https://issues.apache.org/jira/browse/MESOS-5320
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.0.0
>Reporter: Till Toenshoff
>Assignee: Till Toenshoff
>  Labels: ssl
>
> I was trying to activate SSL within Mesos but had rendered an invalid 
> certificate, it was signed with a mismatching key. Once I started the master, 
> the error message I received was rather confusing to me:
> {noformat}
> W0503 10:15:58.027343  6696 openssl.cpp:363] Failed SSL connections will be 
> downgraded to a non-SSL socket
> Could not load key file
> {noformat} 
> To me, this error message hinted that the key file was not existing or had 
> rights issues. However, a quick {{strace}} revealed  that the key-file was 
> properly accessed, no sign of a file-not-found or alike.
> The problem here is the hardcoded error-message, not taking OpenSSL's human 
> readable error strings into account.
> The code that misguided me is located at  
> https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/openssl.cpp#L471
> We might want to change
> {noformat}
>   // Set private key.
>   if (SSL_CTX_use_PrivateKey_file(
>   ctx,
>   ssl_flags->key_file.get().c_str(),
>   SSL_FILETYPE_PEM) != 1) {
> EXIT(EXIT_FAILURE) << "Could not load key file";
>   }
> {noformat}
> Towards something like this
> {noformat}
>   // Set private key.
>   if (SSL_CTX_use_PrivateKey_file(
>   ctx,
>   ssl_flags->key_file.get().c_str(),
>   SSL_FILETYPE_PEM) != 1) {
> EXIT(EXIT_FAILURE) << "Could not use key file: " << 
> ERR_error_string(ERR_get_error(), NULL);
>   }
> {noformat}
> To receive a much more helpful message like this
> {noformat}
> W0503 13:18:12.551364 11572 openssl.cpp:363] Failed SSL connections will be 
> downgraded to a non-SSL socket
> Could not use key file: error:0B080074:x509 certificate 
> routines:X509_check_private_key:key values mismatch
> {noformat}
> A quick scan of the implementation within {{openssl.cpp}} to me suggests that 
> there are more places that we might want to update with more deterministic 
> error messages. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5952) Update docs for new PARTITION_AWARE behavior

2016-09-19 Thread Neil Conway (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neil Conway updated MESOS-5952:
---
Summary: Update docs for new PARTITION_AWARE behavior  (was: Update docs 
for new slave removal behavior)

> Update docs for new PARTITION_AWARE behavior
> 
>
> Key: MESOS-5952
> URL: https://issues.apache.org/jira/browse/MESOS-5952
> Project: Mesos
>  Issue Type: Improvement
>  Components: documentation
>Reporter: Neil Conway
>Assignee: Neil Conway
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6206) Change reconciliation to return results for in-progress removals and reregistrations

2016-09-19 Thread Neil Conway (JIRA)
Neil Conway created MESOS-6206:
--

 Summary: Change reconciliation to return results for in-progress 
removals and reregistrations
 Key: MESOS-6206
 URL: https://issues.apache.org/jira/browse/MESOS-6206
 Project: Mesos
  Issue Type: Bug
  Components: master
Reporter: Neil Conway
Assignee: Neil Conway


The master does not return any reconciliation results for agents it views as 
"transitioning". An agent is defined as transitioning if any of the following 
are true:

1. The master recovered from the registry after failover but the agent has not 
yet reregistered
2. The master is in the process of removing an admitted agent from the registry
3. The master is in the process of re-registering an agent (i.e., re-adding it 
to the list of admitted agents).

I think case #1 makes sense but cases #2 and #3 do not. Before the registry 
operation completes, we should instead view the slave as still being in its 
previous state ("admitted" for case 2 and not-admitted/unreachable/etc. for 
case 3).

Reasons to make this change:
1. Improve consistency with output of endpoints, etc.: until the registry 
operation to remove/re-admit a slave finishes, we show the previous state of 
the slave in the HTTP endpoints. Returning reconciliation results that are 
consistent with HTTP endpoint values is sensible.
2. It is simpler. Rather than not sending anything to frameworks and requiring 
that they ask us again later, it is simpler to just send the current state of 
the agent. If that state changes (whether due to the registry operation 
succeeding or a subsequent state change), then the reconciliation results might 
be stale -- so be it. Such stale information fundamentally cannot be avoided.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6202) Docker containerizer kills containers whose name starts with 'mesos-'

2016-09-19 Thread haosdent (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15502997#comment-15502997
 ] 

haosdent commented on MESOS-6202:
-

This requires we update the UUID::fromString to 
{code}
-  static UUID fromString(const std::string& s)
+  static Try fromString(const std::string& s)
   {
-// NOTE: We don't use THREAD_LOCAL for the `string_generator`
-// (unlike for the `random_generator` above), because it is cheap
-// to construct one each time.
-boost::uuids::string_generator gen;
-boost::uuids::uuid uuid = gen(s);
-return UUID(uuid);
+try {
+  // NOTE: We don't use THREAD_LOCAL for the `string_generator`
+  // (unlike for the `random_generator` above), because it is cheap
+  // to construct one each time.
+  boost::uuids::string_generator gen;
+  boost::uuids::uuid uuid = gen(s);
+  return UUID(uuid);
+} catch (const std::exception& e) {
+  return Error("Invalid UUID '" + s + "': " + e.what());
+} catch (...) {
+  return Error("Invalid UUID '" + s + "': unknown exception.");
+}
   }
{code}

first.

> Docker containerizer kills containers whose name starts with 'mesos-'
> -
>
> Key: MESOS-6202
> URL: https://issues.apache.org/jira/browse/MESOS-6202
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, docker
>Affects Versions: 1.0.1
> Environment: Dockerized 
> {{mesosphere/mesos-slave:1.0.1-2.0.93.ubuntu1404}}
>Reporter: Marc Villacorta
>
> I run 3 docker containers in my CoreOS system whose names start with 
> _'mesos-'_ those are: _'mesos-master'_, _'mesos-dns'_ and _'mesos-agent'_.
> I can start the first two without any problem but when I start the third one 
> _('mesos-agent')_ all three containers are killed by the docker daemon.
> If I rename the containers to _'m3s0s-master'_, _'m3s0s-dns'_ and 
> _'m3s0s-agent'_ everything works.
> I tracked down the problem to 
> [this|https://github.com/apache/mesos/blob/16a563aca1f226b021b8f8815c4d115a3212f02b/src/slave/containerizer/docker.cpp#L116-L120]
>  code which is marked to be removed after deprecation cycle.
> I was previously running Mesos 0.28.2 without this problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6205) mesos-master can not found mesos-slave, and elect a new leader in a short interval

2016-09-19 Thread haosdent (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15502887#comment-15502887
 ] 

haosdent commented on MESOS-6205:
-

hi, [~mithril] You need to make sure every master could talk to each other. 
According to your logs, your master still not finish leader selection. 

> mesos-master can not found mesos-slave, and elect a new leader in a short 
> interval
> --
>
> Key: MESOS-6205
> URL: https://issues.apache.org/jira/browse/MESOS-6205
> Project: Mesos
>  Issue Type: Bug
>  Components: master
> Environment: ubuntu 12 x64, centos 6.5 x64, centos 7.2 x64
>Reporter: kasim
>
> I follow this 
> [doc][https://open.mesosphere.com/getting-started/install/#verifying-installation]
>  to setup mesos cluster.
> There are three vm(ubuntu 12, centos 6.5, centos 7.2).
> $ cat /etc/hosts
> 10.142.55.190 zk1
> 10.142.55.196 zk2
> 10.142.55.202 zk3
> config in each mathine:
> $ cat /etc/mesos/zk
> zk://10.142.55.190:2181,10.142.55.196:2181,10.142.55.202:2181/mesos
> 
> After start zookeeper, mesos-master and mesos-slave in three vm, I can view 
> the mesos webui(10.142.55.190:5050), but agents count is 0.
> After a little time, mesos page get error:
> Failed to connect to 10.142.55.190:5050!
> Retrying in 16 seconds... 
> (I found that zookeeper would elect a new leader in a short interval)
> 
> mesos-master cmd:
> ```
> mesos-master --agent_ping_timeout="15secs" 
> --agent_reregister_timeout="10mins" --allocation_interval="1secs" 
> --allocator="HierarchicalDRF" --authenticate_agents="false" 
> --authenticate_frameworks="false" --authenticate_http_frameworks="false" 
> --authenticate_http_readonly="false" --authenticate_http_readwrite="false" 
> --authenticators="crammd5" --authorizers="local" --framework_sorter="drf" 
> --help="false" --hostname_lookup="true" --http_authenticators="basic" 
> --initialize_driver_logging="true" --ip="10.142.55.190" 
> --log_auto_initialize="true" --log_dir="/var/log/mesos" --logbufsecs="0" 
> --logging_level="INFO" --max_agent_ping_timeouts="5" 
> --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" 
> --port="5050" --quiet="false" --quorum="2" 
> --recovery_agent_removal_limit="100%" --registry="replicated_log" 
> --registry_fetch_timeout="1mins" --registry_store_timeout="20secs" 
> --registry_strict="false" --root_submissions="true" --user_sorter="drf" 
> --version="false" --webui_dir="/usr/share/mesos/webui" 
> --work_dir="/var/lib/mesos" 
> --zk="zk://10.142.55.190:2181,10.142.55.196:2181,10.142.55.202:2181/mesos"
> ```
> mesos-slave cmd:
> ```
> mesos-slave --appc_simple_discovery_uri_prefix="http://"; 
> --appc_store_dir="/tmp/mesos/store/appc" --authenticate_http_readonly="false" 
> --authenticate_http_readwrite="false" --authenticatee="crammd5" 
> --authentication_backoff_factor="1secs" --authorizer="local" 
> --cgroups_cpu_enable_pids_and_tids_count="false" --cgroups_enable_cfs="false" 
> --cgroups_hierarchy="/sys/fs/cgroup" --cgroups_limit_swap="false" 
> --cgroups_root="mesos" --container_disk_watch_interval="15secs" 
> --containerizers="mesos" --default_role="*" --disk_watch_interval="1mins" 
> --docker="docker" --docker_kill_orphans="true" 
> --docker_registry="https://registry-1.docker.io"; --docker_remove_delay="6hrs" 
> --docker_socket="/var/run/docker.sock" --docker_stop_timeout="0ns" 
> --docker_store_dir="/tmp/mesos/store/docker" 
> --docker_volume_checkpoint_dir="/var/run/mesos/isolators/docker/volume" 
> --enforce_container_disk_quota="false" 
> --executor_registration_timeout="1mins" 
> --executor_shutdown_grace_period="5secs" 
> --fetcher_cache_dir="/tmp/mesos/fetch" --fetcher_cache_size="2GB" 
> --frameworks_home="" --gc_delay="1weeks" --gc_disk_headroom="0.1" 
> --hadoop_home="" --help="false" --hostname="10.142.55.190" 
> --hostname_lookup="true" --http_authenticators="basic" 
> --http_command_executor="false" --image_provisioner_backend="copy" 
> --initialize_driver_logging="true" --ip="10.142.55.190" 
> --isolation="posix/cpu,posix/mem" --launcher="posix" 
> --launcher_dir="/usr/libexec/mesos" --log_dir="/var/log/mesos" 
> --logbufsecs="0" --logging_level="INFO" 
> --master="zk://10.142.55.190:2181,10.142.55.196:2181,10.142.55.202:2181/mesos"
>  --oversubscribed_resources_interval="15secs" --perf_duration="10secs" 
> --perf_interval="1mins" --port="5051" --qos_correction_interval_min="0ns" 
> --quiet="false" --recover="reconnect" --recovery_timeout="15mins" 
> --registration_backoff_factor="1secs" --revocable_cpu_low_priority="true" 
> --sandbox_directory="/mnt/mesos/sandbox" --strict="true" --switch_user="true" 
> --systemd_enable_support="true" 
> --systemd_runtim

[jira] [Updated] (MESOS-6205) mesos-master can not found mesos-slave, and elect a new leader in a short interval

2016-09-19 Thread kasim (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kasim updated MESOS-6205:
-
Description: 
I follow this 
[doc][https://open.mesosphere.com/getting-started/install/#verifying-installation]
 to setup mesos cluster.

There are three vm(ubuntu 12, centos 6.5, centos 7.2).

$ cat /etc/hosts
10.142.55.190 zk1
10.142.55.196 zk2
10.142.55.202 zk3

config in each mathine:

$ cat /etc/mesos/zk
zk://10.142.55.190:2181,10.142.55.196:2181,10.142.55.202:2181/mesos


After start zookeeper, mesos-master and mesos-slave in three vm, I can view the 
mesos webui(10.142.55.190:5050), but agents count is 0.
After a little time, mesos page get error:

Failed to connect to 10.142.55.190:5050!
Retrying in 16 seconds... 
(I found that zookeeper would elect a new leader in a short interval)


mesos-master cmd:
```
mesos-master --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
--allocation_interval="1secs" --allocator="HierarchicalDRF" 
--authenticate_agents="false" --authenticate_frameworks="false" 
--authenticate_http_frameworks="false" --authenticate_http_readonly="false" 
--authenticate_http_readwrite="false" --authenticators="crammd5" 
--authorizers="local" --framework_sorter="drf" --help="false" 
--hostname_lookup="true" --http_authenticators="basic" 
--initialize_driver_logging="true" --ip="10.142.55.190" 
--log_auto_initialize="true" --log_dir="/var/log/mesos" --logbufsecs="0" 
--logging_level="INFO" --max_agent_ping_timeouts="5" 
--max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" 
--port="5050" --quiet="false" --quorum="2" 
--recovery_agent_removal_limit="100%" --registry="replicated_log" 
--registry_fetch_timeout="1mins" --registry_store_timeout="20secs" 
--registry_strict="false" --root_submissions="true" --user_sorter="drf" 
--version="false" --webui_dir="/usr/share/mesos/webui" 
--work_dir="/var/lib/mesos" 
--zk="zk://10.142.55.190:2181,10.142.55.196:2181,10.142.55.202:2181/mesos"
```


mesos-slave cmd:
```
mesos-slave --appc_simple_discovery_uri_prefix="http://"; 
--appc_store_dir="/tmp/mesos/store/appc" --authenticate_http_readonly="false" 
--authenticate_http_readwrite="false" --authenticatee="crammd5" 
--authentication_backoff_factor="1secs" --authorizer="local" 
--cgroups_cpu_enable_pids_and_tids_count="false" --cgroups_enable_cfs="false" 
--cgroups_hierarchy="/sys/fs/cgroup" --cgroups_limit_swap="false" 
--cgroups_root="mesos" --container_disk_watch_interval="15secs" 
--containerizers="mesos" --default_role="*" --disk_watch_interval="1mins" 
--docker="docker" --docker_kill_orphans="true" 
--docker_registry="https://registry-1.docker.io"; --docker_remove_delay="6hrs" 
--docker_socket="/var/run/docker.sock" --docker_stop_timeout="0ns" 
--docker_store_dir="/tmp/mesos/store/docker" 
--docker_volume_checkpoint_dir="/var/run/mesos/isolators/docker/volume" 
--enforce_container_disk_quota="false" --executor_registration_timeout="1mins" 
--executor_shutdown_grace_period="5secs" --fetcher_cache_dir="/tmp/mesos/fetch" 
--fetcher_cache_size="2GB" --frameworks_home="" --gc_delay="1weeks" 
--gc_disk_headroom="0.1" --hadoop_home="" --help="false" 
--hostname="10.142.55.190" --hostname_lookup="true" 
--http_authenticators="basic" --http_command_executor="false" 
--image_provisioner_backend="copy" --initialize_driver_logging="true" 
--ip="10.142.55.190" --isolation="posix/cpu,posix/mem" --launcher="posix" 
--launcher_dir="/usr/libexec/mesos" --log_dir="/var/log/mesos" --logbufsecs="0" 
--logging_level="INFO" 
--master="zk://10.142.55.190:2181,10.142.55.196:2181,10.142.55.202:2181/mesos" 
--oversubscribed_resources_interval="15secs" --perf_duration="10secs" 
--perf_interval="1mins" --port="5051" --qos_correction_interval_min="0ns" 
--quiet="false" --recover="reconnect" --recovery_timeout="15mins" 
--registration_backoff_factor="1secs" --revocable_cpu_low_priority="true" 
--sandbox_directory="/mnt/mesos/sandbox" --strict="true" --switch_user="true" 
--systemd_enable_support="true" 
--systemd_runtime_directory="/run/systemd/system" --version="false" 
--work_dir="/var/lib/mesos"
```

When I run mesos-master from command-line, I got 

```
I0919 17:20:19.286264 17550 replica.cpp:673] Replica in VOTING status received 
a broadcasted recover request from (583)@10.142.55.202:5050
F0919 17:20:20.009371 17556 master.cpp:1536] Recovery failed: Failed to recover 
registrar: Failed to perform fetch within 1mins
*** Check failure stack trace: ***
@ 0x7f9db78458dd  google::LogMessage::Fail()
@ 0x7f9db784771d  google::LogMessage::SendToLog()
@ 0x7f9db78454cc  google::LogMessage::Flush()
@ 0x7f9db7848019  google::LogMessageFatal::~LogMessageFatal()
@ 0x7f9db6e2dbbc  mesos::internal::master::fail()
@ 0x7f9db6e75b20  
_ZNSt17_Function_handlerIFvRKSsEZNK7proce

[jira] [Updated] (MESOS-6205) mesos-master can not found mesos-slave, and elect a new leader in a short interval

2016-09-19 Thread kasim (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kasim updated MESOS-6205:
-
Description: 
I follow this 
[doc][https://open.mesosphere.com/getting-started/install/#verifying-installation]
 to setup mesos cluster.

There are three vm(ubuntu 12, centos 6.5, centos 7.2).

$ cat /etc/hosts
10.142.55.190 zk1
10.142.55.196 zk2
10.142.55.202 zk3

config in each mathine:

$ cat /etc/mesos/zk
zk://10.142.55.190:2181,10.142.55.196:2181,10.142.55.202:2181/mesos


After start zookeeper, mesos-master and mesos-slave in three vm, I can view the 
mesos webui(10.142.55.190:5050), but agents count is 0.
After a little time, mesos page get error:

Failed to connect to 10.142.55.190:5050!
Retrying in 16 seconds... 
(I found that zookeeper would elect a new leader in a short interval)


mesos-master cmd:
```
mesos-master --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
--allocation_interval="1secs" --allocator="HierarchicalDRF" 
--authenticate_agents="false" --authenticate_frameworks="false" 
--authenticate_http_frameworks="false" --authenticate_http_readonly="false" 
--authenticate_http_readwrite="false" --authenticators="crammd5" 
--authorizers="local" --framework_sorter="drf" --help="false" 
--hostname_lookup="true" --http_authenticators="basic" 
--initialize_driver_logging="true" --ip="10.142.55.190" 
--log_auto_initialize="true" --log_dir="/var/log/mesos" --logbufsecs="0" 
--logging_level="INFO" --max_agent_ping_timeouts="5" 
--max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" 
--port="5050" --quiet="false" --quorum="2" 
--recovery_agent_removal_limit="100%" --registry="replicated_log" 
--registry_fetch_timeout="1mins" --registry_store_timeout="20secs" 
--registry_strict="false" --root_submissions="true" --user_sorter="drf" 
--version="false" --webui_dir="/usr/share/mesos/webui" 
--work_dir="/var/lib/mesos" 
--zk="zk://10.142.55.190:2181,10.142.55.196:2181,10.142.55.202:2181/mesos"
```


mesos-slave cmd:
```
mesos-slave --appc_simple_discovery_uri_prefix="http://"; 
--appc_store_dir="/tmp/mesos/store/appc" --authenticate_http_readonly="false" 
--authenticate_http_readwrite="false" --authenticatee="crammd5" 
--authentication_backoff_factor="1secs" --authorizer="local" 
--cgroups_cpu_enable_pids_and_tids_count="false" --cgroups_enable_cfs="false" 
--cgroups_hierarchy="/sys/fs/cgroup" --cgroups_limit_swap="false" 
--cgroups_root="mesos" --container_disk_watch_interval="15secs" 
--containerizers="mesos" --default_role="*" --disk_watch_interval="1mins" 
--docker="docker" --docker_kill_orphans="true" 
--docker_registry="https://registry-1.docker.io"; --docker_remove_delay="6hrs" 
--docker_socket="/var/run/docker.sock" --docker_stop_timeout="0ns" 
--docker_store_dir="/tmp/mesos/store/docker" 
--docker_volume_checkpoint_dir="/var/run/mesos/isolators/docker/volume" 
--enforce_container_disk_quota="false" --executor_registration_timeout="1mins" 
--executor_shutdown_grace_period="5secs" --fetcher_cache_dir="/tmp/mesos/fetch" 
--fetcher_cache_size="2GB" --frameworks_home="" --gc_delay="1weeks" 
--gc_disk_headroom="0.1" --hadoop_home="" --help="false" 
--hostname="10.142.55.190" --hostname_lookup="true" 
--http_authenticators="basic" --http_command_executor="false" 
--image_provisioner_backend="copy" --initialize_driver_logging="true" 
--ip="10.142.55.190" --isolation="posix/cpu,posix/mem" --launcher="posix" 
--launcher_dir="/usr/libexec/mesos" --log_dir="/var/log/mesos" --logbufsecs="0" 
--logging_level="INFO" 
--master="zk://10.142.55.190:2181,10.142.55.196:2181,10.142.55.202:2181/mesos" 
--oversubscribed_resources_interval="15secs" --perf_duration="10secs" 
--perf_interval="1mins" --port="5051" --qos_correction_interval_min="0ns" 
--quiet="false" --recover="reconnect" --recovery_timeout="15mins" 
--registration_backoff_factor="1secs" --revocable_cpu_low_priority="true" 
--sandbox_directory="/mnt/mesos/sandbox" --strict="true" --switch_user="true" 
--systemd_enable_support="true" 
--systemd_runtime_directory="/run/systemd/system" --version="false" 
--work_dir="/var/lib/mesos"
```

When I run mesos-master from command-line, I got 

```
I0919 17:20:19.286264 17550 replica.cpp:673] Replica in VOTING status received 
a broadcasted recover request from (583)@10.142.55.202:5050
F0919 17:20:20.009371 17556 master.cpp:1536] Recovery failed: Failed to recover 
registrar: Failed to perform fetch within 1mins
*** Check failure stack trace: ***
@ 0x7f9db78458dd  google::LogMessage::Fail()
@ 0x7f9db784771d  google::LogMessage::SendToLog()
@ 0x7f9db78454cc  google::LogMessage::Flush()
@ 0x7f9db7848019  google::LogMessageFatal::~LogMessageFatal()
@ 0x7f9db6e2dbbc  mesos::internal::master::fail()
@ 0x7f9db6e75b20  
_ZNSt17_Function_handlerIFvRKSsEZNK7proce

[jira] [Updated] (MESOS-6205) mesos-master can not found mesos-slave, and elect a new leader in a short interval

2016-09-19 Thread kasim (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kasim updated MESOS-6205:
-
Description: 
I follow this 
[doc][https://open.mesosphere.com/getting-started/install/#verifying-installation]
 to setup mesos cluster.

There are three vm(ubuntu 12, centos 6.5, centos 7.2).

$ cat /etc/hosts
10.142.55.190 zk1
10.142.55.196 zk2
10.142.55.202 zk3

config in each mathine:

$ cat /etc/mesos/zk
zk://10.142.55.190:2181,10.142.55.196:2181,10.142.55.202:2181/mesos


After start zookeeper, mesos-master and mesos-slave in three vm, I can view the 
mesos webui(10.142.55.190:5050), but agents count is 0.
After a little time, mesos page get error:

Failed to connect to 10.142.55.190:5050!
Retrying in 16 seconds... 
(I found that zookeeper would elect a new leader in a short interval)


mesos-master cmd:
```
mesos-master --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
--allocation_interval="1secs" --allocator="HierarchicalDRF" 
--authenticate_agents="false" --authenticate_frameworks="false" 
--authenticate_http_frameworks="false" --authenticate_http_readonly="false" 
--authenticate_http_readwrite="false" --authenticators="crammd5" 
--authorizers="local" --framework_sorter="drf" --help="false" 
--hostname_lookup="true" --http_authenticators="basic" 
--initialize_driver_logging="true" --ip="10.142.55.190" 
--log_auto_initialize="true" --log_dir="/var/log/mesos" --logbufsecs="0" 
--logging_level="INFO" --max_agent_ping_timeouts="5" 
--max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" 
--port="5050" --quiet="false" --quorum="2" 
--recovery_agent_removal_limit="100%" --registry="replicated_log" 
--registry_fetch_timeout="1mins" --registry_store_timeout="20secs" 
--registry_strict="false" --root_submissions="true" --user_sorter="drf" 
--version="false" --webui_dir="/usr/share/mesos/webui" 
--work_dir="/var/lib/mesos" 
--zk="zk://10.142.55.190:2181,10.142.55.196:2181,10.142.55.202:2181/mesos"
```


mesos-slave cmd:
```
mesos-slave --appc_simple_discovery_uri_prefix="http://"; 
--appc_store_dir="/tmp/mesos/store/appc" --authenticate_http_readonly="false" 
--authenticate_http_readwrite="false" --authenticatee="crammd5" 
--authentication_backoff_factor="1secs" --authorizer="local" 
--cgroups_cpu_enable_pids_and_tids_count="false" --cgroups_enable_cfs="false" 
--cgroups_hierarchy="/sys/fs/cgroup" --cgroups_limit_swap="false" 
--cgroups_root="mesos" --container_disk_watch_interval="15secs" 
--containerizers="mesos" --default_role="*" --disk_watch_interval="1mins" 
--docker="docker" --docker_kill_orphans="true" 
--docker_registry="https://registry-1.docker.io"; --docker_remove_delay="6hrs" 
--docker_socket="/var/run/docker.sock" --docker_stop_timeout="0ns" 
--docker_store_dir="/tmp/mesos/store/docker" 
--docker_volume_checkpoint_dir="/var/run/mesos/isolators/docker/volume" 
--enforce_container_disk_quota="false" --executor_registration_timeout="1mins" 
--executor_shutdown_grace_period="5secs" --fetcher_cache_dir="/tmp/mesos/fetch" 
--fetcher_cache_size="2GB" --frameworks_home="" --gc_delay="1weeks" 
--gc_disk_headroom="0.1" --hadoop_home="" --help="false" 
--hostname="10.142.55.190" --hostname_lookup="true" 
--http_authenticators="basic" --http_command_executor="false" 
--image_provisioner_backend="copy" --initialize_driver_logging="true" 
--ip="10.142.55.190" --isolation="posix/cpu,posix/mem" --launcher="posix" 
--launcher_dir="/usr/libexec/mesos" --log_dir="/var/log/mesos" --logbufsecs="0" 
--logging_level="INFO" 
--master="zk://10.142.55.190:2181,10.142.55.196:2181,10.142.55.202:2181/mesos" 
--oversubscribed_resources_interval="15secs" --perf_duration="10secs" 
--perf_interval="1mins" --port="5051" --qos_correction_interval_min="0ns" 
--quiet="false" --recover="reconnect" --recovery_timeout="15mins" 
--registration_backoff_factor="1secs" --revocable_cpu_low_priority="true" 
--sandbox_directory="/mnt/mesos/sandbox" --strict="true" --switch_user="true" 
--systemd_enable_support="true" 
--systemd_runtime_directory="/run/systemd/system" --version="false" 
--work_dir="/var/lib/mesos"
```

When I run mesos-master from command-line, I got 

```
I0919 17:20:19.286264 17550 replica.cpp:673] Replica in VOTING status received 
a broadcasted recover request from (583)@10.142.55.202:5050
F0919 17:20:20.009371 17556 master.cpp:1536] Recovery failed: Failed to recover 
registrar: Failed to perform fetch within 1mins
*** Check failure stack trace: ***
@ 0x7f9db78458dd  google::LogMessage::Fail()
@ 0x7f9db784771d  google::LogMessage::SendToLog()
@ 0x7f9db78454cc  google::LogMessage::Flush()
@ 0x7f9db7848019  google::LogMessageFatal::~LogMessageFatal()
@ 0x7f9db6e2dbbc  mesos::internal::master::fail()
@ 0x7f9db6e75b20  
_ZNSt17_Function_handlerIFvRKSsEZNK7proce

[jira] [Updated] (MESOS-6205) mesos-master can not found mesos-slave, and elect a new leader in a short interval

2016-09-19 Thread kasim (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kasim updated MESOS-6205:
-
Description: 
I follow this 
[doc][https://open.mesosphere.com/getting-started/install/#verifying-installation]
 to setup mesos cluster.

There are three vm(ubuntu 12, centos 6.5, centos 7.2).

$ cat /etc/hosts
10.142.55.190 zk1
10.142.55.196 zk2
10.142.55.202 zk3

config in each mathine:

$ cat /etc/mesos/zk
zk://10.142.55.190:2181,10.142.55.196:2181,10.142.55.202:2181/mesos

After start zookeeper, mesos-master and mesos-slave in three vm, I can view the 
mesos webui(10.142.55.190:5050), but agents count is 0.

After a little time, mesos page get error:

Failed to connect to 10.142.55.190:5050!
Retrying in 16 seconds... 
(I found that zookeeper would elect a new leader in a short interval)

mesos-master cmd:
```
mesos-master --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
--allocation_interval="1secs" --allocator="HierarchicalDRF" 
--authenticate_agents="false" --authenticate_frameworks="false" 
--authenticate_http_frameworks="false" --authenticate_http_readonly="false" 
--authenticate_http_readwrite="false" --authenticators="crammd5" 
--authorizers="local" --framework_sorter="drf" --help="false" 
--hostname_lookup="true" --http_authenticators="basic" 
--initialize_driver_logging="true" --ip="10.142.55.190" 
--log_auto_initialize="true" --log_dir="/var/log/mesos" --logbufsecs="0" 
--logging_level="INFO" --max_agent_ping_timeouts="5" 
--max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" 
--port="5050" --quiet="false" --quorum="2" 
--recovery_agent_removal_limit="100%" --registry="replicated_log" 
--registry_fetch_timeout="1mins" --registry_store_timeout="20secs" 
--registry_strict="false" --root_submissions="true" --user_sorter="drf" 
--version="false" --webui_dir="/usr/share/mesos/webui" 
--work_dir="/var/lib/mesos" 
--zk="zk://10.142.55.190:2181,10.142.55.196:2181,10.142.55.202:2181/mesos"
```


mesos-slave cmd:
```
mesos-slave --appc_simple_discovery_uri_prefix="http://"; 
--appc_store_dir="/tmp/mesos/store/appc" --authenticate_http_readonly="false" 
--authenticate_http_readwrite="false" --authenticatee="crammd5" 
--authentication_backoff_factor="1secs" --authorizer="local" 
--cgroups_cpu_enable_pids_and_tids_count="false" --cgroups_enable_cfs="false" 
--cgroups_hierarchy="/sys/fs/cgroup" --cgroups_limit_swap="false" 
--cgroups_root="mesos" --container_disk_watch_interval="15secs" 
--containerizers="mesos" --default_role="*" --disk_watch_interval="1mins" 
--docker="docker" --docker_kill_orphans="true" 
--docker_registry="https://registry-1.docker.io"; --docker_remove_delay="6hrs" 
--docker_socket="/var/run/docker.sock" --docker_stop_timeout="0ns" 
--docker_store_dir="/tmp/mesos/store/docker" 
--docker_volume_checkpoint_dir="/var/run/mesos/isolators/docker/volume" 
--enforce_container_disk_quota="false" --executor_registration_timeout="1mins" 
--executor_shutdown_grace_period="5secs" --fetcher_cache_dir="/tmp/mesos/fetch" 
--fetcher_cache_size="2GB" --frameworks_home="" --gc_delay="1weeks" 
--gc_disk_headroom="0.1" --hadoop_home="" --help="false" 
--hostname="10.142.55.190" --hostname_lookup="true" 
--http_authenticators="basic" --http_command_executor="false" 
--image_provisioner_backend="copy" --initialize_driver_logging="true" 
--ip="10.142.55.190" --isolation="posix/cpu,posix/mem" --launcher="posix" 
--launcher_dir="/usr/libexec/mesos" --log_dir="/var/log/mesos" --logbufsecs="0" 
--logging_level="INFO" 
--master="zk://10.142.55.190:2181,10.142.55.196:2181,10.142.55.202:2181/mesos" 
--oversubscribed_resources_interval="15secs" --perf_duration="10secs" 
--perf_interval="1mins" --port="5051" --qos_correction_interval_min="0ns" 
--quiet="false" --recover="reconnect" --recovery_timeout="15mins" 
--registration_backoff_factor="1secs" --revocable_cpu_low_priority="true" 
--sandbox_directory="/mnt/mesos/sandbox" --strict="true" --switch_user="true" 
--systemd_enable_support="true" 
--systemd_runtime_directory="/run/systemd/system" --version="false" 
--work_dir="/var/lib/mesos"
```

When I run mesos-master from command-line, I got 

```
I0919 17:20:19.286264 17550 replica.cpp:673] Replica in VOTING status received 
a broadcasted recover request from (583)@10.142.55.202:5050
F0919 17:20:20.009371 17556 master.cpp:1536] Recovery failed: Failed to recover 
registrar: Failed to perform fetch within 1mins
*** Check failure stack trace: ***
@ 0x7f9db78458dd  google::LogMessage::Fail()
@ 0x7f9db784771d  google::LogMessage::SendToLog()
@ 0x7f9db78454cc  google::LogMessage::Flush()
@ 0x7f9db7848019  google::LogMessageFatal::~LogMessageFatal()
@ 0x7f9db6e2dbbc  mesos::internal::master::fail()
@ 0x7f9db6e75b20  
_ZNSt17_Function_handlerIFvRKSsEZNK7process6FutureI7NothingE8onFailedISt5_BindIFPFvS1_S1_EPKcSt12_PlaceholderI

[jira] [Updated] (MESOS-6110) Deprecate using health checks without setting the type

2016-09-19 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-6110:
---
Sprint: Mesosphere Sprint 42, Mesosphere Sprint 43  (was: Mesosphere Sprint 
42)

> Deprecate using health checks without setting the type
> --
>
> Key: MESOS-6110
> URL: https://issues.apache.org/jira/browse/MESOS-6110
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.1.0
>Reporter: Silas Snider
>Assignee: haosdent
>Priority: Blocker
>  Labels: compatibility, health-check, mesosphere
>
> When sending a task launch using the 1.0.x protos and the legacy (non-http) 
> API, tasks with a healthcheck defined are rejected (TASK_ERROR) because the 
> 'type' field is not set.
> This field is marked optional in the proto and is not available before 1.1.0, 
> so it should not be required in order to keep the mesos v1 api compatibility 
> promise.
> For backwards compatibility temporarily allow the use case when command 
> health check is set without a type.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5987) Update health check protobuf for HTTP and TCP health check

2016-09-19 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-5987:
---
Sprint: Mesosphere Sprint 40, Mesosphere Sprint 42, Mesosphere Sprint 43  
(was: Mesosphere Sprint 40, Mesosphere Sprint 42)

> Update health check protobuf for HTTP and TCP health check
> --
>
> Key: MESOS-5987
> URL: https://issues.apache.org/jira/browse/MESOS-5987
> Project: Mesos
>  Issue Type: Task
>Reporter: haosdent
>Assignee: haosdent
>  Labels: health-check, mesosphere
> Fix For: 1.1.0
>
>
> To support HTTP and TCP health check, we need to update the existing 
> {{HealthCheck}} protobuf message according to [~alexr] and [~gaston] 
> commented in https://reviews.apache.org/r/36816/ and 
> https://reviews.apache.org/r/49360/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6119) TCP health checks are not portable.

2016-09-19 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-6119:
---
Sprint: Mesosphere Sprint 42, Mesosphere Sprint 43  (was: Mesosphere Sprint 
42)

> TCP health checks are not portable.
> ---
>
> Key: MESOS-6119
> URL: https://issues.apache.org/jira/browse/MESOS-6119
> Project: Mesos
>  Issue Type: Bug
>Reporter: Alexander Rukletsov
>Assignee: Alexander Rukletsov
>  Labels: health-check, mesosphere
>
> MESOS-3567 introduced a dependency on "bash" for TCP health checks, which is 
> undesirable. We should implement a portable solution for TCP health checks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6202) Docker containerizer kills containers whose name starts with 'mesos-'

2016-09-19 Thread Marc Villacorta (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15502699#comment-15502699
 ] 

Marc Villacorta commented on MESOS-6202:


Would you considere adding a validation to make sure {{id}} is a valid Docker 
UUID?

> Docker containerizer kills containers whose name starts with 'mesos-'
> -
>
> Key: MESOS-6202
> URL: https://issues.apache.org/jira/browse/MESOS-6202
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, docker
>Affects Versions: 1.0.1
> Environment: Dockerized 
> {{mesosphere/mesos-slave:1.0.1-2.0.93.ubuntu1404}}
>Reporter: Marc Villacorta
>
> I run 3 docker containers in my CoreOS system whose names start with 
> _'mesos-'_ those are: _'mesos-master'_, _'mesos-dns'_ and _'mesos-agent'_.
> I can start the first two without any problem but when I start the third one 
> _('mesos-agent')_ all three containers are killed by the docker daemon.
> If I rename the containers to _'m3s0s-master'_, _'m3s0s-dns'_ and 
> _'m3s0s-agent'_ everything works.
> I tracked down the problem to 
> [this|https://github.com/apache/mesos/blob/16a563aca1f226b021b8f8815c4d115a3212f02b/src/slave/containerizer/docker.cpp#L116-L120]
>  code which is marked to be removed after deprecation cycle.
> I was previously running Mesos 0.28.2 without this problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6205) mesos-master can not found mesos-slave, and elect a new leader in a short interval

2016-09-19 Thread kasim (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kasim updated MESOS-6205:
-
Description: 
I follow this 
[doc][https://open.mesosphere.com/getting-started/install/#verifying-installation]
 to setup mesos cluster.

There are three vm(ubuntu 12, centos 6.5, centos 7.2).

$ cat /etc/hosts
10.142.55.190 zk1
10.142.55.196 zk2
10.142.55.202 zk3

config in each mathine:

$ cat /etc/mesos/zk
zk://10.142.55.190:2181,10.142.55.196:2181,10.142.55.202:2181/mesos

After start zookeeper, mesos-master and mesos-slave in three vm, I can view the 
mesos webui(10.142.55.190:5050), but agents count is 0.

After a little time, mesos page get error:

Failed to connect to 10.142.55.190:5050!
Retrying in 16 seconds... 
(I found that zookeeper would elect a new leader in a short interval)


master info log:

I0919 15:54:59.677438 13281 http.cpp:2022] Redirecting request for 
/master/state?jsonp=angular.callbacks._1x to the leading master zk3
I0919 15:55:00.098667 13281 replica.cpp:673] Replica in VOTING status 
received a broadcasted recover request from (768)@10.142.55.202:5050
I0919 15:55:00.385279 13281 replica.cpp:673] Replica in VOTING status 
received a broadcasted recover request from (185)@10.142.55.196:5050
I0919 15:55:00.79 13281 replica.cpp:673] Replica in VOTING status 
received a broadcasted recover request from (771)@10.142.55.202:5050
I0919 15:55:01.347291 13284 replica.cpp:673] Replica in VOTING status 
received a broadcasted recover request from (188)@10.142.55.196:5050
I0919 15:55:01.597682 13284 replica.cpp:673] Replica in VOTING status 
received a broadcasted recover request from (774)@10.142.55.202:5050
I0919 15:55:02.257159 13282 replica.cpp:673] Replica in VOTING status 
received a broadcasted recover request from (191)@10.142.55.196:5050
I0919 15:55:02.370692 13287 replica.cpp:673] Replica in VOTING status 
received a broadcasted recover request from (777)@10.142.55.202:5050
I0919 15:55:03.205920 13285 replica.cpp:673] Replica in VOTING status 
received a broadcasted recover request from (780)@10.142.55.202:5050
I0919 15:55:03.260007 13281 replica.cpp:673] Replica in VOTING status 
received a broadcasted recover request from (194)@10.142.55.196:5050
I0919 15:55:03.929611 13283 replica.cpp:673] Replica in VOTING status 
received a broadcasted recover request from (783)@10.142.55.202:5050
I0919 15:55:04.033308 13287 replica.cpp:673] Replica in VOTING status 
received a broadcasted recover request from (197)@10.142.55.196:5050
I0919 15:55:04.591275 13284 replica.cpp:673] Replica in VOTING status 
received a broadcasted recover request from (200)@10.142.55.196:5050
I0919 15:55:04.608211 13283 replica.cpp:673] Replica in VOTING status 
received a broadcasted recover request from (786)@10.142.55.202:5050
I0919 15:55:05.184682 13280 replica.cpp:673] Replica in VOTING status 
received a broadcasted recover request from (789)@10.142.55.202:5050
I0919 15:55:05.268277 13280 replica.cpp:673] Replica in VOTING status 
received a broadcasted recover request from (203)@10.142.55.196:5050
I0919 15:55:05.775377 13281 replica.cpp:673] Replica in VOTING status 
received a broadcasted recover request from (206)@10.142.55.196:5050
I0919 15:55:05.916445 13285 replica.cpp:673] Replica in VOTING status 
received a broadcasted recover request from (792)@10.142.55.202:5050
I0919 15:55:06.744927 13280 replica.cpp:673] Replica in VOTING status 
received a broadcasted recover request from (209)@10.142.55.196:5050
I0919 15:55:07.378521 13283 replica.cpp:673] Replica in VOTING status 
received a broadcasted recover request from (5)@10.142.55.202:5050
I0919 15:55:07.393311 13285 network.hpp:430] ZooKeeper group memberships 
changed
I0919 15:55:07.393427 13285 group.cpp:706] Trying to get 
'/mesos/log_replicas/000709' in ZooKeeper
I0919 15:55:07.393985 13285 group.cpp:706] Trying to get 
'/mesos/log_replicas/000711' in ZooKeeper
I0919 15:55:07.394394 13285 group.cpp:706] Trying to get 
'/mesos/log_replicas/000714' in ZooKeeper
I0919 15:55:07.394843 13285 group.cpp:706] Trying to get 
'/mesos/log_replicas/000715' in ZooKeeper
I0919 15:55:07.395418 13285 network.hpp:478] ZooKeeper group PIDs: { 
log-replica(1)@10.142.55.190:5050, log-replica(1)@10.142.55.196:5050, 
log-replica(1)@10.142.55.202:5050 }
I0919 15:55:08.178272 13280 replica.cpp:673] Replica in VOTING status 
received a broadcasted recover request from (14)@10.142.55.202:5050
I0919 15:55:09.059562 13282 replica.cpp:673] Replica in VOTING status 
received a broadcasted recover request from (21)@10.142.55.202:5050
I0919 15:55:09.700711 13286 replica.cpp:673] Replica in VOTING status 
received a broadcasted recover request from (24)@10.142.55.202:5050
I0919 15:55:09.742185 13287 http.cpp:381] HTTP GET for /m

[jira] [Updated] (MESOS-6205) mesos-master can not found mesos-slave, and elect a new leader in a short interval

2016-09-19 Thread kasim (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kasim updated MESOS-6205:
-
Description: 
I follow this 
[doc][https://open.mesosphere.com/getting-started/install/#verifying-installation]
 to setup mesos cluster.

There are three vm(ubuntu 12, centos 6.5, centos 7.2).

$ cat /etc/hosts
10.142.55.190 zk1
10.142.55.196 zk2
10.142.55.202 zk3

config in each mathine:

$ cat /etc/mesos/zk
zk://10.142.55.190:2181,10.142.55.196:2181,10.142.55.202:2181/mesos

After I start zookeeper, mesos-master and mesos-slave in three vm, I can view 
the mesos webui(10.142.55.190:5050). I found agents count is 0.

After a little time, mesos page get error:

Failed to connect to 10.142.55.190:5050!
Retrying in 16 seconds... 
(I found that zookeeper would elect a new leader in a short interval)


master info log:

I0919 15:54:59.677438 13281 http.cpp:2022] Redirecting request for 
/master/state?jsonp=angular.callbacks._1x to the leading master zk3
I0919 15:55:00.098667 13281 replica.cpp:673] Replica in VOTING status 
received a broadcasted recover request from (768)@10.142.55.202:5050
I0919 15:55:00.385279 13281 replica.cpp:673] Replica in VOTING status 
received a broadcasted recover request from (185)@10.142.55.196:5050
I0919 15:55:00.79 13281 replica.cpp:673] Replica in VOTING status 
received a broadcasted recover request from (771)@10.142.55.202:5050
I0919 15:55:01.347291 13284 replica.cpp:673] Replica in VOTING status 
received a broadcasted recover request from (188)@10.142.55.196:5050
I0919 15:55:01.597682 13284 replica.cpp:673] Replica in VOTING status 
received a broadcasted recover request from (774)@10.142.55.202:5050
I0919 15:55:02.257159 13282 replica.cpp:673] Replica in VOTING status 
received a broadcasted recover request from (191)@10.142.55.196:5050
I0919 15:55:02.370692 13287 replica.cpp:673] Replica in VOTING status 
received a broadcasted recover request from (777)@10.142.55.202:5050
I0919 15:55:03.205920 13285 replica.cpp:673] Replica in VOTING status 
received a broadcasted recover request from (780)@10.142.55.202:5050
I0919 15:55:03.260007 13281 replica.cpp:673] Replica in VOTING status 
received a broadcasted recover request from (194)@10.142.55.196:5050
I0919 15:55:03.929611 13283 replica.cpp:673] Replica in VOTING status 
received a broadcasted recover request from (783)@10.142.55.202:5050
I0919 15:55:04.033308 13287 replica.cpp:673] Replica in VOTING status 
received a broadcasted recover request from (197)@10.142.55.196:5050
I0919 15:55:04.591275 13284 replica.cpp:673] Replica in VOTING status 
received a broadcasted recover request from (200)@10.142.55.196:5050
I0919 15:55:04.608211 13283 replica.cpp:673] Replica in VOTING status 
received a broadcasted recover request from (786)@10.142.55.202:5050
I0919 15:55:05.184682 13280 replica.cpp:673] Replica in VOTING status 
received a broadcasted recover request from (789)@10.142.55.202:5050
I0919 15:55:05.268277 13280 replica.cpp:673] Replica in VOTING status 
received a broadcasted recover request from (203)@10.142.55.196:5050
I0919 15:55:05.775377 13281 replica.cpp:673] Replica in VOTING status 
received a broadcasted recover request from (206)@10.142.55.196:5050
I0919 15:55:05.916445 13285 replica.cpp:673] Replica in VOTING status 
received a broadcasted recover request from (792)@10.142.55.202:5050
I0919 15:55:06.744927 13280 replica.cpp:673] Replica in VOTING status 
received a broadcasted recover request from (209)@10.142.55.196:5050
I0919 15:55:07.378521 13283 replica.cpp:673] Replica in VOTING status 
received a broadcasted recover request from (5)@10.142.55.202:5050
I0919 15:55:07.393311 13285 network.hpp:430] ZooKeeper group memberships 
changed
I0919 15:55:07.393427 13285 group.cpp:706] Trying to get 
'/mesos/log_replicas/000709' in ZooKeeper
I0919 15:55:07.393985 13285 group.cpp:706] Trying to get 
'/mesos/log_replicas/000711' in ZooKeeper
I0919 15:55:07.394394 13285 group.cpp:706] Trying to get 
'/mesos/log_replicas/000714' in ZooKeeper
I0919 15:55:07.394843 13285 group.cpp:706] Trying to get 
'/mesos/log_replicas/000715' in ZooKeeper
I0919 15:55:07.395418 13285 network.hpp:478] ZooKeeper group PIDs: { 
log-replica(1)@10.142.55.190:5050, log-replica(1)@10.142.55.196:5050, 
log-replica(1)@10.142.55.202:5050 }
I0919 15:55:08.178272 13280 replica.cpp:673] Replica in VOTING status 
received a broadcasted recover request from (14)@10.142.55.202:5050
I0919 15:55:09.059562 13282 replica.cpp:673] Replica in VOTING status 
received a broadcasted recover request from (21)@10.142.55.202:5050
I0919 15:55:09.700711 13286 replica.cpp:673] Replica in VOTING status 
received a broadcasted recover request from (24)@10.142.55.202:5050
I0919 15:55:09.742185 13287 http.cpp:381] HTTP GET 

[jira] [Updated] (MESOS-6205) mesos-master can not found mesos-slave, and elect a new leader in a short interval

2016-09-19 Thread kasim (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kasim updated MESOS-6205:
-
Description: 
I follow this 
[doc][https://open.mesosphere.com/getting-started/install/#verifying-installation]
 to setup mesos cluster.

There are three vm(ubuntu 12, centos 6.5, centos 7.2).

$ cat /etc/hosts
10.142.55.190 zk1
10.142.55.196 zk2
10.142.55.202 zk3

config in each mathine:

$ cat /etc/mesos/zk
zk://10.142.55.190:2181,10.142.55.196:2181,10.142.55.202:2181/mesos

After I start zookeeper, mesos-master and mesos-slave in three vm, I can view 
the mesos webui(10.142.55.190:5050). I found agents count is 0.

After a little time, mesos page get error:

Failed to connect to 10.142.55.190:5050!
Retrying in 16 seconds... 
(I found that zookeeper would elect a new leader in a short interval)


master info log:

I0919 15:54:59.677438 13281 http.cpp:2022] Redirecting request for 
/master/state?jsonp=angular.callbacks._1x to the leading master zk3
I0919 15:55:00.098667 13281 replica.cpp:673] Replica in VOTING status 
received a broadcasted recover request from (768)@10.142.55.202:5050
I0919 15:55:00.385279 13281 replica.cpp:673] Replica in VOTING status 
received a broadcasted recover request from (185)@10.142.55.196:5050
I0919 15:55:00.79 13281 replica.cpp:673] Replica in VOTING status 
received a broadcasted recover request from (771)@10.142.55.202:5050
I0919 15:55:01.347291 13284 replica.cpp:673] Replica in VOTING status 
received a broadcasted recover request from (188)@10.142.55.196:5050
I0919 15:55:01.597682 13284 replica.cpp:673] Replica in VOTING status 
received a broadcasted recover request from (774)@10.142.55.202:5050
I0919 15:55:02.257159 13282 replica.cpp:673] Replica in VOTING status 
received a broadcasted recover request from (191)@10.142.55.196:5050
I0919 15:55:02.370692 13287 replica.cpp:673] Replica in VOTING status 
received a broadcasted recover request from (777)@10.142.55.202:5050
I0919 15:55:03.205920 13285 replica.cpp:673] Replica in VOTING status 
received a broadcasted recover request from (780)@10.142.55.202:5050
I0919 15:55:03.260007 13281 replica.cpp:673] Replica in VOTING status 
received a broadcasted recover request from (194)@10.142.55.196:5050
I0919 15:55:03.929611 13283 replica.cpp:673] Replica in VOTING status 
received a broadcasted recover request from (783)@10.142.55.202:5050
I0919 15:55:04.033308 13287 replica.cpp:673] Replica in VOTING status 
received a broadcasted recover request from (197)@10.142.55.196:5050
I0919 15:55:04.591275 13284 replica.cpp:673] Replica in VOTING status 
received a broadcasted recover request from (200)@10.142.55.196:5050
I0919 15:55:04.608211 13283 replica.cpp:673] Replica in VOTING status 
received a broadcasted recover request from (786)@10.142.55.202:5050
I0919 15:55:05.184682 13280 replica.cpp:673] Replica in VOTING status 
received a broadcasted recover request from (789)@10.142.55.202:5050
I0919 15:55:05.268277 13280 replica.cpp:673] Replica in VOTING status 
received a broadcasted recover request from (203)@10.142.55.196:5050
I0919 15:55:05.775377 13281 replica.cpp:673] Replica in VOTING status 
received a broadcasted recover request from (206)@10.142.55.196:5050
I0919 15:55:05.916445 13285 replica.cpp:673] Replica in VOTING status 
received a broadcasted recover request from (792)@10.142.55.202:5050
I0919 15:55:06.744927 13280 replica.cpp:673] Replica in VOTING status 
received a broadcasted recover request from (209)@10.142.55.196:5050
I0919 15:55:07.378521 13283 replica.cpp:673] Replica in VOTING status 
received a broadcasted recover request from (5)@10.142.55.202:5050
I0919 15:55:07.393311 13285 network.hpp:430] ZooKeeper group memberships 
changed
I0919 15:55:07.393427 13285 group.cpp:706] Trying to get 
'/mesos/log_replicas/000709' in ZooKeeper
I0919 15:55:07.393985 13285 group.cpp:706] Trying to get 
'/mesos/log_replicas/000711' in ZooKeeper
I0919 15:55:07.394394 13285 group.cpp:706] Trying to get 
'/mesos/log_replicas/000714' in ZooKeeper
I0919 15:55:07.394843 13285 group.cpp:706] Trying to get 
'/mesos/log_replicas/000715' in ZooKeeper
I0919 15:55:07.395418 13285 network.hpp:478] ZooKeeper group PIDs: { 
log-replica(1)@10.142.55.190:5050, log-replica(1)@10.142.55.196:5050, 
log-replica(1)@10.142.55.202:5050 }
I0919 15:55:08.178272 13280 replica.cpp:673] Replica in VOTING status 
received a broadcasted recover request from (14)@10.142.55.202:5050
I0919 15:55:09.059562 13282 replica.cpp:673] Replica in VOTING status 
received a broadcasted recover request from (21)@10.142.55.202:5050
I0919 15:55:09.700711 13286 replica.cpp:673] Replica in VOTING status 
received a broadcasted recover request from (24)@10.142.55.202:5050
I0919 15:55:09.742185 13287 http.cpp:381] HTTP GET 

[jira] [Created] (MESOS-6205) mesos-master can not found mesos-slave, and elect a new leader in a short interval

2016-09-19 Thread kasim (JIRA)
kasim created MESOS-6205:


 Summary: mesos-master can not found mesos-slave, and elect a new 
leader in a short interval
 Key: MESOS-6205
 URL: https://issues.apache.org/jira/browse/MESOS-6205
 Project: Mesos
  Issue Type: Bug
  Components: master
 Environment: ubuntu 12 x64, centos 6.5 x64, centos 7.2 x64
Reporter: kasim


I follow this [doc][1] to setup mesos cluster.

There are three vm(ubuntu 12, centos 6.5, centos 7.2).

$ cat /etc/hosts
10.142.55.190 zk1
10.142.55.196 zk2
10.142.55.202 zk3

config in each mathine:

$ cat /etc/mesos/zk
zk://10.142.55.190:2181,10.142.55.196:2181,10.142.55.202:2181/mesos

After I start mesos-master in three vm, I can view the mesos 
webui(10.142.55.190:5050),  but after a little time, mesos page get error:

Failed to connect to 10.142.55.190:5050!
Retrying in 16 seconds... 
(I found that zookeeper would elect a new leader in a short interval)


master info log:

I0919 15:54:59.677438 13281 http.cpp:2022] Redirecting request for 
/master/state?jsonp=angular.callbacks._1x to the leading master zk3
I0919 15:55:00.098667 13281 replica.cpp:673] Replica in VOTING status 
received a broadcasted recover request from (768)@10.142.55.202:5050
I0919 15:55:00.385279 13281 replica.cpp:673] Replica in VOTING status 
received a broadcasted recover request from (185)@10.142.55.196:5050
I0919 15:55:00.79 13281 replica.cpp:673] Replica in VOTING status 
received a broadcasted recover request from (771)@10.142.55.202:5050
I0919 15:55:01.347291 13284 replica.cpp:673] Replica in VOTING status 
received a broadcasted recover request from (188)@10.142.55.196:5050
I0919 15:55:01.597682 13284 replica.cpp:673] Replica in VOTING status 
received a broadcasted recover request from (774)@10.142.55.202:5050
I0919 15:55:02.257159 13282 replica.cpp:673] Replica in VOTING status 
received a broadcasted recover request from (191)@10.142.55.196:5050
I0919 15:55:02.370692 13287 replica.cpp:673] Replica in VOTING status 
received a broadcasted recover request from (777)@10.142.55.202:5050
I0919 15:55:03.205920 13285 replica.cpp:673] Replica in VOTING status 
received a broadcasted recover request from (780)@10.142.55.202:5050
I0919 15:55:03.260007 13281 replica.cpp:673] Replica in VOTING status 
received a broadcasted recover request from (194)@10.142.55.196:5050
I0919 15:55:03.929611 13283 replica.cpp:673] Replica in VOTING status 
received a broadcasted recover request from (783)@10.142.55.202:5050
I0919 15:55:04.033308 13287 replica.cpp:673] Replica in VOTING status 
received a broadcasted recover request from (197)@10.142.55.196:5050
I0919 15:55:04.591275 13284 replica.cpp:673] Replica in VOTING status 
received a broadcasted recover request from (200)@10.142.55.196:5050
I0919 15:55:04.608211 13283 replica.cpp:673] Replica in VOTING status 
received a broadcasted recover request from (786)@10.142.55.202:5050
I0919 15:55:05.184682 13280 replica.cpp:673] Replica in VOTING status 
received a broadcasted recover request from (789)@10.142.55.202:5050
I0919 15:55:05.268277 13280 replica.cpp:673] Replica in VOTING status 
received a broadcasted recover request from (203)@10.142.55.196:5050
I0919 15:55:05.775377 13281 replica.cpp:673] Replica in VOTING status 
received a broadcasted recover request from (206)@10.142.55.196:5050
I0919 15:55:05.916445 13285 replica.cpp:673] Replica in VOTING status 
received a broadcasted recover request from (792)@10.142.55.202:5050
I0919 15:55:06.744927 13280 replica.cpp:673] Replica in VOTING status 
received a broadcasted recover request from (209)@10.142.55.196:5050
I0919 15:55:07.378521 13283 replica.cpp:673] Replica in VOTING status 
received a broadcasted recover request from (5)@10.142.55.202:5050
I0919 15:55:07.393311 13285 network.hpp:430] ZooKeeper group memberships 
changed
I0919 15:55:07.393427 13285 group.cpp:706] Trying to get 
'/mesos/log_replicas/000709' in ZooKeeper
I0919 15:55:07.393985 13285 group.cpp:706] Trying to get 
'/mesos/log_replicas/000711' in ZooKeeper
I0919 15:55:07.394394 13285 group.cpp:706] Trying to get 
'/mesos/log_replicas/000714' in ZooKeeper
I0919 15:55:07.394843 13285 group.cpp:706] Trying to get 
'/mesos/log_replicas/000715' in ZooKeeper
I0919 15:55:07.395418 13285 network.hpp:478] ZooKeeper group PIDs: { 
log-replica(1)@10.142.55.190:5050, log-replica(1)@10.142.55.196:5050, 
log-replica(1)@10.142.55.202:5050 }
I0919 15:55:08.178272 13280 replica.cpp:673] Replica in VOTING status 
received a broadcasted recover request from (14)@10.142.55.202:5050
I0919 15:55:09.059562 13282 replica.cpp:673] Replica in VOTING status 
received a broadcasted recover request from (21)@10.142.55.202:5050
I0919 15:55:09.700711 13286 replica.cpp:673] Replica in VOTING status