[jira] [Commented] (MESOS-6810) Tasks getting stuck in STAGING state when using unified containerizer

2016-12-16 Thread Yu Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15756585#comment-15756585
 ] 

Yu Yang commented on MESOS-6810:


this is the output of {{curl -V}}
{quote}
curl 7.47.0 (x86_64-pc-linux-gnu) libcurl/7.47.0 GnuTLS/3.4.10 zlib/1.2.8 
libidn/1.32 librtmp/2.3
Protocols: dict file ftp ftps gopher http https imap imaps ldap ldaps pop3 
pop3s rtmp rtsp smb smbs smtp smtps telnet tftp 
Features: AsynchDNS IDN IPv6 Largefile GSS-API Kerberos SPNEGO NTLM NTLM_WB SSL 
libz TLS-SRP UnixSockets 
{quote}

I should clarify that this problem also reproduced in a mesos cluster with 
nearly 30 machines, the error mesage is:
{{Failed to launch container: Collect failed: Failed to perform 'curl': curl: 
(56) SSL read: error::lib(0):func(0):reason(0), errno 104; Container 
destroyed while provisioning images}}

> Tasks getting stuck in STAGING state when using unified containerizer
> -
>
> Key: MESOS-6810
> URL: https://issues.apache.org/jira/browse/MESOS-6810
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, docker
>Affects Versions: 1.0.0, 1.0.1, 1.1.0
> Environment: *OS*: ubuntu16.04 64bit
> *mesos*: 1.1.0, one master and one agent on same machine
> *Agent flag*: {{sudo ./bin/mesos-agent.sh --master=192.168.1.192:5050 
> --work_dir=/tmp/mesos_slave --image_providers=docker 
> --isolation=docker/runtime,filesystem/linux,cgroups/devices,gpu/nvidia 
> --containerizers=mesos,docker --executor_environment_variables="{}"}}
>Reporter: Yu Yang
>
> when submit tasks using container settings like:
> {code}
> {
> "container": {
> "mesos": {
>   "image": {
>   "docker": {
>   "name": "nvidia/cuda"
>   },
>   "type": "DOCKER"
>   }
> },
>"type": "MESOS"
> },
> }
> {code}
> then task will get stuck in STAGING state, and finally it will fail with 
> message {{Failed to launch container: Collect failed: Failed to perform 
> 'curl': curl: (56) GnuTLS recv error (-54): Error in pull function}}  
>   this is the related log on 
> agent
> {code}
> I1217 13:05:35.406365 20780 slave.cpp:1539] Got assigned task 
> 'mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591' for framework 
> 02083c57-b2d9-4054-babe-90e962816813-0001
> I1217 13:05:35.406749 20780 slave.cpp:1701] Launching task 
> 'mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591' for framework 
> 02083c57-b2d9-4054-babe-90e962816813-0001
> I1217 13:05:35.406970 20780 paths.cpp:536] Trying to chown 
> '/tmp/mesos_slave/slaves/02083c57-b2d9-4054-babe-90e962816813-S0/frameworks/02083c57-b2d9-4054-babe-90e962816813-0001/executors/mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591/runs/8be3b5cd-afa3-4189-aa2a-f09d73529f8c'
>  to user 'root'
> I1217 13:05:35.409272 20780 slave.cpp:6179] Launching executor 
> 'mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591' of framework 
> 02083c57-b2d9-4054-babe-90e962816813-0001 with resources cpus(*):0.1; 
> mem(*):32 in work directory 
> '/tmp/mesos_slave/slaves/02083c57-b2d9-4054-babe-90e962816813-S0/frameworks/02083c57-b2d9-4054-babe-90e962816813-0001/executors/mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591/runs/8be3b5cd-afa3-4189-aa2a-f09d73529f8c'
> I1217 13:05:35.409958 20780 slave.cpp:1987] Queued task 
> 'mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591' for executor 
> 'mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591' of framework 
> 02083c57-b2d9-4054-babe-90e962816813-0001
> I1217 13:05:35.410163 20779 docker.cpp:1000] Skipping non-docker container
> I1217 13:05:35.410636 20776 containerizer.cpp:938] Starting container 
> 8be3b5cd-afa3-4189-aa2a-f09d73529f8c for executor 
> 'mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591' of framework 
> 02083c57-b2d9-4054-babe-90e962816813-0001
> I1217 13:05:44.459362 20778 slave.cpp:4992] Terminating executor 
> ''cuda_mesos_nvidia_tf.72e9b9cf-8220-49bd-86fe-1667ee5e7a02' of framework 
> 02083c57-b2d9-4054-babe-90e962816813-0001' because it did not register within 
> 1mins
> I1217 13:05:53.586819 20780 slave.cpp:5044] Current disk usage 63.59%. Max 
> allowed age: 1.848503351525151days
> I1217 13:06:35.410905 20777 slave.cpp:4992] Terminating executor 
> ''mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591' of framework 
> 02083c57-b2d9-4054-babe-90e962816813-0001' because it did not register within 
> 1mins
> I1217 13:06:35.411175 20780 containerizer.cpp:1950] Destroying container 
> 8be3b5cd-afa3-4189-aa2a-f09d73529f8c in PROVISIONING state
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6810) Tasks getting stuck in STAGING state when using unified containerizer

2016-12-16 Thread haosdent (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15756564#comment-15756564
 ] 

haosdent commented on MESOS-6810:
-

Could you show the result of 
{code}
curl -V
{code}

After some search, I think your curl is compiled with GNUTLS which may fail 
when fetching large files. https://curl.haxx.se/docs/ssl-compared.html

> Tasks getting stuck in STAGING state when using unified containerizer
> -
>
> Key: MESOS-6810
> URL: https://issues.apache.org/jira/browse/MESOS-6810
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, docker
>Affects Versions: 1.0.0, 1.0.1, 1.1.0
> Environment: *OS*: ubuntu16.04 64bit
> *mesos*: 1.1.0, one master and one agent on same machine
> *Agent flag*: {{sudo ./bin/mesos-agent.sh --master=192.168.1.192:5050 
> --work_dir=/tmp/mesos_slave --image_providers=docker 
> --isolation=docker/runtime,filesystem/linux,cgroups/devices,gpu/nvidia 
> --containerizers=mesos,docker --executor_environment_variables="{}"}}
>Reporter: Yu Yang
>
> when submit tasks using container settings like:
> {code}
> {
> "container": {
> "mesos": {
>   "image": {
>   "docker": {
>   "name": "nvidia/cuda"
>   },
>   "type": "DOCKER"
>   }
> },
>"type": "MESOS"
> },
> }
> {code}
> then task will get stuck in STAGING state, and finally it will fail with 
> message {{Failed to launch container: Collect failed: Failed to perform 
> 'curl': curl: (56) GnuTLS recv error (-54): Error in pull function}}  
>   this is the related log on 
> agent
> {code}
> I1217 13:05:35.406365 20780 slave.cpp:1539] Got assigned task 
> 'mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591' for framework 
> 02083c57-b2d9-4054-babe-90e962816813-0001
> I1217 13:05:35.406749 20780 slave.cpp:1701] Launching task 
> 'mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591' for framework 
> 02083c57-b2d9-4054-babe-90e962816813-0001
> I1217 13:05:35.406970 20780 paths.cpp:536] Trying to chown 
> '/tmp/mesos_slave/slaves/02083c57-b2d9-4054-babe-90e962816813-S0/frameworks/02083c57-b2d9-4054-babe-90e962816813-0001/executors/mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591/runs/8be3b5cd-afa3-4189-aa2a-f09d73529f8c'
>  to user 'root'
> I1217 13:05:35.409272 20780 slave.cpp:6179] Launching executor 
> 'mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591' of framework 
> 02083c57-b2d9-4054-babe-90e962816813-0001 with resources cpus(*):0.1; 
> mem(*):32 in work directory 
> '/tmp/mesos_slave/slaves/02083c57-b2d9-4054-babe-90e962816813-S0/frameworks/02083c57-b2d9-4054-babe-90e962816813-0001/executors/mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591/runs/8be3b5cd-afa3-4189-aa2a-f09d73529f8c'
> I1217 13:05:35.409958 20780 slave.cpp:1987] Queued task 
> 'mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591' for executor 
> 'mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591' of framework 
> 02083c57-b2d9-4054-babe-90e962816813-0001
> I1217 13:05:35.410163 20779 docker.cpp:1000] Skipping non-docker container
> I1217 13:05:35.410636 20776 containerizer.cpp:938] Starting container 
> 8be3b5cd-afa3-4189-aa2a-f09d73529f8c for executor 
> 'mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591' of framework 
> 02083c57-b2d9-4054-babe-90e962816813-0001
> I1217 13:05:44.459362 20778 slave.cpp:4992] Terminating executor 
> ''cuda_mesos_nvidia_tf.72e9b9cf-8220-49bd-86fe-1667ee5e7a02' of framework 
> 02083c57-b2d9-4054-babe-90e962816813-0001' because it did not register within 
> 1mins
> I1217 13:05:53.586819 20780 slave.cpp:5044] Current disk usage 63.59%. Max 
> allowed age: 1.848503351525151days
> I1217 13:06:35.410905 20777 slave.cpp:4992] Terminating executor 
> ''mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591' of framework 
> 02083c57-b2d9-4054-babe-90e962816813-0001' because it did not register within 
> 1mins
> I1217 13:06:35.411175 20780 containerizer.cpp:1950] Destroying container 
> 8be3b5cd-afa3-4189-aa2a-f09d73529f8c in PROVISIONING state
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6810) Tasks getting stuck in STAGING state when using unified containerizer

2016-12-16 Thread haosdent (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

haosdent updated MESOS-6810:

Description: 
when submit tasks using container settings like:
{code}
{
"container": {
"mesos": {
"image": {
"docker": {
"name": "nvidia/cuda"
},
"type": "DOCKER"
}
},
   "type": "MESOS"
},
}
{code}
then task will get stuck in STAGING state, and finally it will fail with 
message {{Failed to launch container: Collect failed: Failed to perform 'curl': 
curl: (56) GnuTLS recv error (-54): Error in pull function}}
this is the related log on agent

{code}
I1217 13:05:35.406365 20780 slave.cpp:1539] Got assigned task 
'mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591' for framework 
02083c57-b2d9-4054-babe-90e962816813-0001
I1217 13:05:35.406749 20780 slave.cpp:1701] Launching task 
'mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591' for framework 
02083c57-b2d9-4054-babe-90e962816813-0001
I1217 13:05:35.406970 20780 paths.cpp:536] Trying to chown 
'/tmp/mesos_slave/slaves/02083c57-b2d9-4054-babe-90e962816813-S0/frameworks/02083c57-b2d9-4054-babe-90e962816813-0001/executors/mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591/runs/8be3b5cd-afa3-4189-aa2a-f09d73529f8c'
 to user 'root'
I1217 13:05:35.409272 20780 slave.cpp:6179] Launching executor 
'mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591' of framework 
02083c57-b2d9-4054-babe-90e962816813-0001 with resources cpus(*):0.1; mem(*):32 
in work directory 
'/tmp/mesos_slave/slaves/02083c57-b2d9-4054-babe-90e962816813-S0/frameworks/02083c57-b2d9-4054-babe-90e962816813-0001/executors/mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591/runs/8be3b5cd-afa3-4189-aa2a-f09d73529f8c'
I1217 13:05:35.409958 20780 slave.cpp:1987] Queued task 
'mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591' for executor 
'mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591' of framework 
02083c57-b2d9-4054-babe-90e962816813-0001
I1217 13:05:35.410163 20779 docker.cpp:1000] Skipping non-docker container
I1217 13:05:35.410636 20776 containerizer.cpp:938] Starting container 
8be3b5cd-afa3-4189-aa2a-f09d73529f8c for executor 
'mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591' of framework 
02083c57-b2d9-4054-babe-90e962816813-0001
I1217 13:05:44.459362 20778 slave.cpp:4992] Terminating executor 
''cuda_mesos_nvidia_tf.72e9b9cf-8220-49bd-86fe-1667ee5e7a02' of framework 
02083c57-b2d9-4054-babe-90e962816813-0001' because it did not register within 
1mins
I1217 13:05:53.586819 20780 slave.cpp:5044] Current disk usage 63.59%. Max 
allowed age: 1.848503351525151days
I1217 13:06:35.410905 20777 slave.cpp:4992] Terminating executor 
''mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591' of framework 
02083c57-b2d9-4054-babe-90e962816813-0001' because it did not register within 
1mins
I1217 13:06:35.411175 20780 containerizer.cpp:1950] Destroying container 
8be3b5cd-afa3-4189-aa2a-f09d73529f8c in PROVISIONING state
{code}

  was:
when submit tasks using container settings like:
{code}
{
"container": {
"mesos": {
"image": {
"docker": {
"name": "nvidia/cuda"
},
"type": "DOCKER"
}
},
   "type": "MESOS"
},
}
{code}
then task will get stuck in STAGING state, and finally it will fail with 
message {{Failed to launch container: Collect failed: Failed to perform 'curl': 
curl: (56) GnuTLS recv error (-54): Error in pull function}}
this is the related log on agent

{quote}
I1217 13:05:35.406365 20780 slave.cpp:1539] Got assigned task 
'mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591' for framework 
02083c57-b2d9-4054-babe-90e962816813-0001
I1217 13:05:35.406749 20780 slave.cpp:1701] Launching task 
'mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591' for framework 
02083c57-b2d9-4054-babe-90e962816813-0001
I1217 13:05:35.406970 20780 paths.cpp:536] Trying to chown 
'/tmp/mesos_slave/slaves/02083c57-b2d9-4054-babe-90e962816813-S0/frameworks/02083c57-b2d9-4054-babe-90e962816813-0001/executors/mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591/runs/8be3b5cd-afa3-4189-aa2a-f09d73529f8c'
 to user 'root'
I1217 13:05:35.409272 20780 slave.cpp:6179] Launching executor 
'mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591' of framework 
02083c57-b2d9-4054-babe-90e962816813-0001 with resources cpus(*):0.1; mem(*):32 
in work directory 

[jira] [Commented] (MESOS-6808) Refactor Docker::run to only take docker cli parameters

2016-12-16 Thread haosdent (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15756528#comment-15756528
 ] 

haosdent commented on MESOS-6808:
-

+1, actually this have affected we add functions to the docker containerizer.

> Refactor Docker::run to only take docker cli parameters
> ---
>
> Key: MESOS-6808
> URL: https://issues.apache.org/jira/browse/MESOS-6808
> Project: Mesos
>  Issue Type: Task
>  Components: docker
>Reporter: Zhitao Li
>Assignee: Zhitao Li
>
> As we discussed, {{Docker::run}} in src/docker/docker.hpp should only 
> understand docker cli options. The logic of creating these options should be 
> refactored to another helper function.
> This will also allow us to overcome the maximum 10 argument limit of GMOCK.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6808) Refactor Docker::run to only take docker cli parameters

2016-12-16 Thread haosdent (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

haosdent updated MESOS-6808:

Priority: Major  (was: Minor)

> Refactor Docker::run to only take docker cli parameters
> ---
>
> Key: MESOS-6808
> URL: https://issues.apache.org/jira/browse/MESOS-6808
> Project: Mesos
>  Issue Type: Task
>  Components: docker
>Reporter: Zhitao Li
>Assignee: Zhitao Li
>
> As we discussed, {{Docker::run}} in src/docker/docker.hpp should only 
> understand docker cli options. The logic of creating these options should be 
> refactored to another helper function.
> This will also allow us to overcome the maximum 10 argument limit of GMOCK.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6810) Tasks getting stuck in STAGING state when using unified containerizer

2016-12-16 Thread Yu Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15756453#comment-15756453
 ] 

Yu Yang commented on MESOS-6810:


Actually I've pulled nvidia/cuda to local and it works well with docker 
containerizer, I also tried passing USTC docker mirror to agent(using 
--docker_registry), but no luck 

> Tasks getting stuck in STAGING state when using unified containerizer
> -
>
> Key: MESOS-6810
> URL: https://issues.apache.org/jira/browse/MESOS-6810
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, docker
>Affects Versions: 1.0.0, 1.0.1, 1.1.0
> Environment: *OS*: ubuntu16.04 64bit
> *mesos*: 1.1.0, one master and one agent on same machine
> *Agent flag*: {{sudo ./bin/mesos-agent.sh --master=192.168.1.192:5050 
> --work_dir=/tmp/mesos_slave --image_providers=docker 
> --isolation=docker/runtime,filesystem/linux,cgroups/devices,gpu/nvidia 
> --containerizers=mesos,docker --executor_environment_variables="{}"}}
>Reporter: Yu Yang
>
> when submit tasks using container settings like:
> {code}
> {
> "container": {
> "mesos": {
>   "image": {
>   "docker": {
>   "name": "nvidia/cuda"
>   },
>   "type": "DOCKER"
>   }
> },
>"type": "MESOS"
> },
> }
> {code}
> then task will get stuck in STAGING state, and finally it will fail with 
> message {{Failed to launch container: Collect failed: Failed to perform 
> 'curl': curl: (56) GnuTLS recv error (-54): Error in pull function}}  
>   this is the related log on 
> agent
> {quote}
> I1217 13:05:35.406365 20780 slave.cpp:1539] Got assigned task 
> 'mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591' for framework 
> 02083c57-b2d9-4054-babe-90e962816813-0001
> I1217 13:05:35.406749 20780 slave.cpp:1701] Launching task 
> 'mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591' for framework 
> 02083c57-b2d9-4054-babe-90e962816813-0001
> I1217 13:05:35.406970 20780 paths.cpp:536] Trying to chown 
> '/tmp/mesos_slave/slaves/02083c57-b2d9-4054-babe-90e962816813-S0/frameworks/02083c57-b2d9-4054-babe-90e962816813-0001/executors/mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591/runs/8be3b5cd-afa3-4189-aa2a-f09d73529f8c'
>  to user 'root'
> I1217 13:05:35.409272 20780 slave.cpp:6179] Launching executor 
> 'mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591' of framework 
> 02083c57-b2d9-4054-babe-90e962816813-0001 with resources cpus(*):0.1; 
> mem(*):32 in work directory 
> '/tmp/mesos_slave/slaves/02083c57-b2d9-4054-babe-90e962816813-S0/frameworks/02083c57-b2d9-4054-babe-90e962816813-0001/executors/mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591/runs/8be3b5cd-afa3-4189-aa2a-f09d73529f8c'
> I1217 13:05:35.409958 20780 slave.cpp:1987] Queued task 
> 'mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591' for executor 
> 'mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591' of framework 
> 02083c57-b2d9-4054-babe-90e962816813-0001
> I1217 13:05:35.410163 20779 docker.cpp:1000] Skipping non-docker container
> I1217 13:05:35.410636 20776 containerizer.cpp:938] Starting container 
> 8be3b5cd-afa3-4189-aa2a-f09d73529f8c for executor 
> 'mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591' of framework 
> 02083c57-b2d9-4054-babe-90e962816813-0001
> I1217 13:05:44.459362 20778 slave.cpp:4992] Terminating executor 
> ''cuda_mesos_nvidia_tf.72e9b9cf-8220-49bd-86fe-1667ee5e7a02' of framework 
> 02083c57-b2d9-4054-babe-90e962816813-0001' because it did not register within 
> 1mins
> I1217 13:05:53.586819 20780 slave.cpp:5044] Current disk usage 63.59%. Max 
> allowed age: 1.848503351525151days
> I1217 13:06:35.410905 20777 slave.cpp:4992] Terminating executor 
> ''mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591' of framework 
> 02083c57-b2d9-4054-babe-90e962816813-0001' because it did not register within 
> 1mins
> I1217 13:06:35.411175 20780 containerizer.cpp:1950] Destroying container 
> 8be3b5cd-afa3-4189-aa2a-f09d73529f8c in PROVISIONING state
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6810) Tasks getting stuck in STAGING state when using unified containerizer

2016-12-16 Thread Yu Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15756414#comment-15756414
 ] 

Yu Yang commented on MESOS-6810:


{quote}
*   Trying 52.45.221.131...
* Connected to registry-1.docker.io (52.45.221.131) port 443 (#0)
* found 173 certificates in /etc/ssl/certs/ca-certificates.crt
* found 697 certificates in /etc/ssl/certs
* ALPN, offering http/1.1
* SSL connection using TLS1.2 / ECDHE_RSA_AES_128_GCM_SHA256
*server certificate verification OK
*server certificate status verification SKIPPED
*common name: *.docker.io (matched)
*server certificate expiration date OK
*server certificate activation date OK
*certificate public key: RSA
*certificate version: #3
*subject: OU=GT98568428,OU=See www.rapidssl.com/resources/cps 
(c)15,OU=Domain Control Validated - RapidSSL(R),CN=*.docker.io
*start date: Thu, 19 Mar 2015 17:34:32 GMT
*expire date: Sat, 21 Apr 2018 01:51:52 GMT
*issuer: C=US,O=GeoTrust Inc.,CN=RapidSSL SHA256 CA - G3
*compression: NULL
* ALPN, server did not agree to a protocol
> GET /v2/nvidia/cuda/manifests/latest HTTP/1.1
> Host: registry-1.docker.io
> User-Agent: curl/7.47.0
> Accept: */*
> 
< HTTP/1.1 401 Unauthorized
< Content-Type: application/json; charset=utf-8
< Docker-Distribution-Api-Version: registry/2.0
< Www-Authenticate: Bearer 
realm="https://auth.docker.io/token",service="registry.docker.io",scope="repository:nvidia/cuda:pull;
< Date: Sat, 17 Dec 2016 05:50:45 GMT
< Content-Length: 143
< Strict-Transport-Security: max-age=31536000
< 
{"errors":[{"code":"UNAUTHORIZED","message":"authentication 
required","detail":[{"Type":"repository","Name":"nvidia/cuda","Action":"pull"}]}]}
* Connection #0 to host registry-1.docker.io left intact

{quote}

> Tasks getting stuck in STAGING state when using unified containerizer
> -
>
> Key: MESOS-6810
> URL: https://issues.apache.org/jira/browse/MESOS-6810
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, docker
>Affects Versions: 1.0.0, 1.0.1, 1.1.0
> Environment: *OS*: ubuntu16.04 64bit
> *mesos*: 1.1.0, one master and one agent on same machine
> *Agent flag*: {{sudo ./bin/mesos-agent.sh --master=192.168.1.192:5050 
> --work_dir=/tmp/mesos_slave --image_providers=docker 
> --isolation=docker/runtime,filesystem/linux,cgroups/devices,gpu/nvidia 
> --containerizers=mesos,docker --executor_environment_variables="{}"}}
>Reporter: Yu Yang
>
> when submit tasks using container settings like:
> {code}
> {
> "container": {
> "mesos": {
>   "image": {
>   "docker": {
>   "name": "nvidia/cuda"
>   },
>   "type": "DOCKER"
>   }
> },
>"type": "MESOS"
> },
> }
> {code}
> then task will get stuck in STAGING state, and finally it will fail with 
> message {{Failed to launch container: Collect failed: Failed to perform 
> 'curl': curl: (56) GnuTLS recv error (-54): Error in pull function}}  
>   this is the related log on 
> agent
> {quote}
> I1217 13:05:35.406365 20780 slave.cpp:1539] Got assigned task 
> 'mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591' for framework 
> 02083c57-b2d9-4054-babe-90e962816813-0001
> I1217 13:05:35.406749 20780 slave.cpp:1701] Launching task 
> 'mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591' for framework 
> 02083c57-b2d9-4054-babe-90e962816813-0001
> I1217 13:05:35.406970 20780 paths.cpp:536] Trying to chown 
> '/tmp/mesos_slave/slaves/02083c57-b2d9-4054-babe-90e962816813-S0/frameworks/02083c57-b2d9-4054-babe-90e962816813-0001/executors/mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591/runs/8be3b5cd-afa3-4189-aa2a-f09d73529f8c'
>  to user 'root'
> I1217 13:05:35.409272 20780 slave.cpp:6179] Launching executor 
> 'mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591' of framework 
> 02083c57-b2d9-4054-babe-90e962816813-0001 with resources cpus(*):0.1; 
> mem(*):32 in work directory 
> '/tmp/mesos_slave/slaves/02083c57-b2d9-4054-babe-90e962816813-S0/frameworks/02083c57-b2d9-4054-babe-90e962816813-0001/executors/mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591/runs/8be3b5cd-afa3-4189-aa2a-f09d73529f8c'
> I1217 13:05:35.409958 20780 slave.cpp:1987] Queued task 
> 'mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591' for executor 
> 'mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591' of framework 
> 02083c57-b2d9-4054-babe-90e962816813-0001
> I1217 13:05:35.410163 20779 docker.cpp:1000] Skipping non-docker container
> I1217 13:05:35.410636 20776 containerizer.cpp:938] Starting container 
> 8be3b5cd-afa3-4189-aa2a-f09d73529f8c for 

[jira] [Commented] (MESOS-6810) Tasks getting stuck in STAGING state when using unified containerizer

2016-12-16 Thread Jie Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15756410#comment-15756410
 ] 

Jie Yu commented on MESOS-6810:
---

Can you {noformat}curl -vvv 
https://registry-1.docker.io/v2/nvidia/cuda/manifests/latest{noformat} and see 
what's the output?

> Tasks getting stuck in STAGING state when using unified containerizer
> -
>
> Key: MESOS-6810
> URL: https://issues.apache.org/jira/browse/MESOS-6810
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, docker
>Affects Versions: 1.0.0, 1.0.1, 1.1.0
> Environment: *OS*: ubuntu16.04 64bit
> *mesos*: 1.1.0, one master and one agent on same machine
> *Agent flag*: {{sudo ./bin/mesos-agent.sh --master=192.168.1.192:5050 
> --work_dir=/tmp/mesos_slave --image_providers=docker 
> --isolation=docker/runtime,filesystem/linux,cgroups/devices,gpu/nvidia 
> --containerizers=mesos,docker --executor_environment_variables="{}"}}
>Reporter: Yu Yang
>
> when submit tasks using container settings like:
> {code}
> {
> "container": {
> "mesos": {
>   "image": {
>   "docker": {
>   "name": "nvidia/cuda"
>   },
>   "type": "DOCKER"
>   }
> },
>"type": "MESOS"
> },
> }
> {code}
> then task will get stuck in STAGING state, and finally it will fail with 
> message {{Failed to launch container: Collect failed: Failed to perform 
> 'curl': curl: (56) GnuTLS recv error (-54): Error in pull function}}  
>   this is the related log on 
> agent
> {quote}
> I1217 13:05:35.406365 20780 slave.cpp:1539] Got assigned task 
> 'mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591' for framework 
> 02083c57-b2d9-4054-babe-90e962816813-0001
> I1217 13:05:35.406749 20780 slave.cpp:1701] Launching task 
> 'mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591' for framework 
> 02083c57-b2d9-4054-babe-90e962816813-0001
> I1217 13:05:35.406970 20780 paths.cpp:536] Trying to chown 
> '/tmp/mesos_slave/slaves/02083c57-b2d9-4054-babe-90e962816813-S0/frameworks/02083c57-b2d9-4054-babe-90e962816813-0001/executors/mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591/runs/8be3b5cd-afa3-4189-aa2a-f09d73529f8c'
>  to user 'root'
> I1217 13:05:35.409272 20780 slave.cpp:6179] Launching executor 
> 'mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591' of framework 
> 02083c57-b2d9-4054-babe-90e962816813-0001 with resources cpus(*):0.1; 
> mem(*):32 in work directory 
> '/tmp/mesos_slave/slaves/02083c57-b2d9-4054-babe-90e962816813-S0/frameworks/02083c57-b2d9-4054-babe-90e962816813-0001/executors/mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591/runs/8be3b5cd-afa3-4189-aa2a-f09d73529f8c'
> I1217 13:05:35.409958 20780 slave.cpp:1987] Queued task 
> 'mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591' for executor 
> 'mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591' of framework 
> 02083c57-b2d9-4054-babe-90e962816813-0001
> I1217 13:05:35.410163 20779 docker.cpp:1000] Skipping non-docker container
> I1217 13:05:35.410636 20776 containerizer.cpp:938] Starting container 
> 8be3b5cd-afa3-4189-aa2a-f09d73529f8c for executor 
> 'mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591' of framework 
> 02083c57-b2d9-4054-babe-90e962816813-0001
> I1217 13:05:44.459362 20778 slave.cpp:4992] Terminating executor 
> ''cuda_mesos_nvidia_tf.72e9b9cf-8220-49bd-86fe-1667ee5e7a02' of framework 
> 02083c57-b2d9-4054-babe-90e962816813-0001' because it did not register within 
> 1mins
> I1217 13:05:53.586819 20780 slave.cpp:5044] Current disk usage 63.59%. Max 
> allowed age: 1.848503351525151days
> I1217 13:06:35.410905 20777 slave.cpp:4992] Terminating executor 
> ''mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591' of framework 
> 02083c57-b2d9-4054-babe-90e962816813-0001' because it did not register within 
> 1mins
> I1217 13:06:35.411175 20780 containerizer.cpp:1950] Destroying container 
> 8be3b5cd-afa3-4189-aa2a-f09d73529f8c in PROVISIONING state
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6810) Tasks getting stuck in STAGING state when using unified containerizer

2016-12-16 Thread Yu Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu Yang updated MESOS-6810:
---
Description: 
when submit tasks using container settings like:
{code}
{
"container": {
"mesos": {
"image": {
"docker": {
"name": "nvidia/cuda"
},
"type": "DOCKER"
}
},
   "type": "MESOS"
},
}
{code}
then task will get stuck in STAGING state, and finally it will fail with 
message {{Failed to launch container: Collect failed: Failed to perform 'curl': 
curl: (56) GnuTLS recv error (-54): Error in pull function}}
this is the related log on agent

{quote}
I1217 13:05:35.406365 20780 slave.cpp:1539] Got assigned task 
'mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591' for framework 
02083c57-b2d9-4054-babe-90e962816813-0001
I1217 13:05:35.406749 20780 slave.cpp:1701] Launching task 
'mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591' for framework 
02083c57-b2d9-4054-babe-90e962816813-0001
I1217 13:05:35.406970 20780 paths.cpp:536] Trying to chown 
'/tmp/mesos_slave/slaves/02083c57-b2d9-4054-babe-90e962816813-S0/frameworks/02083c57-b2d9-4054-babe-90e962816813-0001/executors/mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591/runs/8be3b5cd-afa3-4189-aa2a-f09d73529f8c'
 to user 'root'
I1217 13:05:35.409272 20780 slave.cpp:6179] Launching executor 
'mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591' of framework 
02083c57-b2d9-4054-babe-90e962816813-0001 with resources cpus(*):0.1; mem(*):32 
in work directory 
'/tmp/mesos_slave/slaves/02083c57-b2d9-4054-babe-90e962816813-S0/frameworks/02083c57-b2d9-4054-babe-90e962816813-0001/executors/mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591/runs/8be3b5cd-afa3-4189-aa2a-f09d73529f8c'
I1217 13:05:35.409958 20780 slave.cpp:1987] Queued task 
'mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591' for executor 
'mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591' of framework 
02083c57-b2d9-4054-babe-90e962816813-0001
I1217 13:05:35.410163 20779 docker.cpp:1000] Skipping non-docker container
I1217 13:05:35.410636 20776 containerizer.cpp:938] Starting container 
8be3b5cd-afa3-4189-aa2a-f09d73529f8c for executor 
'mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591' of framework 
02083c57-b2d9-4054-babe-90e962816813-0001
I1217 13:05:44.459362 20778 slave.cpp:4992] Terminating executor 
''cuda_mesos_nvidia_tf.72e9b9cf-8220-49bd-86fe-1667ee5e7a02' of framework 
02083c57-b2d9-4054-babe-90e962816813-0001' because it did not register within 
1mins
I1217 13:05:53.586819 20780 slave.cpp:5044] Current disk usage 63.59%. Max 
allowed age: 1.848503351525151days
I1217 13:06:35.410905 20777 slave.cpp:4992] Terminating executor 
''mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591' of framework 
02083c57-b2d9-4054-babe-90e962816813-0001' because it did not register within 
1mins
I1217 13:06:35.411175 20780 containerizer.cpp:1950] Destroying container 
8be3b5cd-afa3-4189-aa2a-f09d73529f8c in PROVISIONING state
{quote}

  was:
when submit tasks using container settings like:
{
"container": {
"mesos": {
"image": {
"docker": {
"name": "nvidia/cuda"
},
"type": "DOCKER"
}
},
"type": "MESOS"
},
}

then task will get stuck in STAGING state, and finally it will fail with 
message {{Failed to launch container: Collect failed: Failed to perform 'curl': 
curl: (56) GnuTLS recv error (-54): Error in pull function}}
this is the related log on agent

{quote}
I1217 13:05:35.406365 20780 slave.cpp:1539] Got assigned task 
'mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591' for framework 
02083c57-b2d9-4054-babe-90e962816813-0001
I1217 13:05:35.406749 20780 slave.cpp:1701] Launching task 
'mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591' for framework 
02083c57-b2d9-4054-babe-90e962816813-0001
I1217 13:05:35.406970 20780 paths.cpp:536] Trying to chown 
'/tmp/mesos_slave/slaves/02083c57-b2d9-4054-babe-90e962816813-S0/frameworks/02083c57-b2d9-4054-babe-90e962816813-0001/executors/mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591/runs/8be3b5cd-afa3-4189-aa2a-f09d73529f8c'
 to user 'root'
I1217 13:05:35.409272 20780 slave.cpp:6179] Launching executor 
'mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591' of framework 
02083c57-b2d9-4054-babe-90e962816813-0001 with resources cpus(*):0.1; mem(*):32 
in work directory 

[jira] [Created] (MESOS-6810) Tasks getting stuck in STAGING state when using unified containerizer

2016-12-16 Thread Yu Yang (JIRA)
Yu Yang created MESOS-6810:
--

 Summary: Tasks getting stuck in STAGING state when using unified 
containerizer
 Key: MESOS-6810
 URL: https://issues.apache.org/jira/browse/MESOS-6810
 Project: Mesos
  Issue Type: Bug
  Components: containerization, docker
Affects Versions: 1.1.0, 1.0.1, 1.0.0
 Environment: *OS*: ubuntu16.04 64bit
*mesos*: 1.1.0, one master and one agent on same machine
*Agent flag*: {{sudo ./bin/mesos-agent.sh --master=192.168.1.192:5050 
--work_dir=/tmp/mesos_slave --image_providers=docker 
--isolation=docker/runtime,filesystem/linux,cgroups/devices,gpu/nvidia 
--containerizers=mesos,docker --executor_environment_variables="{}"}}
Reporter: Yu Yang


when submit tasks using container settings like:
{
"container": {
"mesos": {
"image": {
"docker": {
"name": "nvidia/cuda"
},
"type": "DOCKER"
}
},
"type": "MESOS"
},
}

then task will get stuck in STAGING state, and finally it will fail with 
message {{Failed to launch container: Collect failed: Failed to perform 'curl': 
curl: (56) GnuTLS recv error (-54): Error in pull function}}
this is the related log on agent

{quote}
I1217 13:05:35.406365 20780 slave.cpp:1539] Got assigned task 
'mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591' for framework 
02083c57-b2d9-4054-babe-90e962816813-0001
I1217 13:05:35.406749 20780 slave.cpp:1701] Launching task 
'mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591' for framework 
02083c57-b2d9-4054-babe-90e962816813-0001
I1217 13:05:35.406970 20780 paths.cpp:536] Trying to chown 
'/tmp/mesos_slave/slaves/02083c57-b2d9-4054-babe-90e962816813-S0/frameworks/02083c57-b2d9-4054-babe-90e962816813-0001/executors/mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591/runs/8be3b5cd-afa3-4189-aa2a-f09d73529f8c'
 to user 'root'
I1217 13:05:35.409272 20780 slave.cpp:6179] Launching executor 
'mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591' of framework 
02083c57-b2d9-4054-babe-90e962816813-0001 with resources cpus(*):0.1; mem(*):32 
in work directory 
'/tmp/mesos_slave/slaves/02083c57-b2d9-4054-babe-90e962816813-S0/frameworks/02083c57-b2d9-4054-babe-90e962816813-0001/executors/mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591/runs/8be3b5cd-afa3-4189-aa2a-f09d73529f8c'
I1217 13:05:35.409958 20780 slave.cpp:1987] Queued task 
'mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591' for executor 
'mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591' of framework 
02083c57-b2d9-4054-babe-90e962816813-0001
I1217 13:05:35.410163 20779 docker.cpp:1000] Skipping non-docker container
I1217 13:05:35.410636 20776 containerizer.cpp:938] Starting container 
8be3b5cd-afa3-4189-aa2a-f09d73529f8c for executor 
'mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591' of framework 
02083c57-b2d9-4054-babe-90e962816813-0001
I1217 13:05:44.459362 20778 slave.cpp:4992] Terminating executor 
''cuda_mesos_nvidia_tf.72e9b9cf-8220-49bd-86fe-1667ee5e7a02' of framework 
02083c57-b2d9-4054-babe-90e962816813-0001' because it did not register within 
1mins
I1217 13:05:53.586819 20780 slave.cpp:5044] Current disk usage 63.59%. Max 
allowed age: 1.848503351525151days
I1217 13:06:35.410905 20777 slave.cpp:4992] Terminating executor 
''mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591' of framework 
02083c57-b2d9-4054-babe-90e962816813-0001' because it did not register within 
1mins
I1217 13:06:35.411175 20780 containerizer.cpp:1950] Destroying container 
8be3b5cd-afa3-4189-aa2a-f09d73529f8c in PROVISIONING state
{quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6336) SlaveTest.KillTaskGroupBetweenRunTaskParts is flaky

2016-12-16 Thread Greg Mann (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15756044#comment-15756044
 ] 

Greg Mann commented on MESOS-6336:
--

Hi [~a10gupta], I'm very sorry for the long delay in my reply!

I just took a look at your patch, thanks for submitting it! While the patch 
does prevent the segfault, I would prefer to get to the bottom of the real 
issue: why are we attempting to remove a framework which is not in the 
{{frameworks}} map? I added some additional logging output, and it looks like 
we're definitely executing {{Slave::removeFramework}} and erasing the 
FrameworkID from the map before {{Slave::finalize}} gets executed. It's not 
immediately clear to me why {{frameworks.keys()}} would return the FrameworkID, 
but then we would segfault when attempting to access the value of that key.

As [~vinodkone] suggested, it appears to me that the framework has been 
completely removed from the map before {{Slave::finalize}} is called.

I notice that in {{shutdownFramework}}, we log a warning and return early if 
{{framework == nullptr}}, so perhaps we do expect to attempt this operation on 
nonexistent frameworks sometimes. Nonetheless, let's do some investigation and 
nail down the root cause of the segfault. Would you like to get on a hangout 
early next week and troubleshoot?

> SlaveTest.KillTaskGroupBetweenRunTaskParts is flaky
> ---
>
> Key: MESOS-6336
> URL: https://issues.apache.org/jira/browse/MESOS-6336
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Reporter: Greg Mann
>Assignee: Abhishek Dasgupta
>  Labels: mesosphere
>
> The test {{SlaveTest.KillTaskGroupBetweenRunTaskParts}} sometimes segfaults 
> during the agent's {{finalize()}} method. This was observed on our internal 
> CI, on Fedora with libev, without SSL:
> {code}
> [ RUN  ] SlaveTest.KillTaskGroupBetweenRunTaskParts
> I1007 14:12:57.973811 28630 cluster.cpp:158] Creating default 'local' 
> authorizer
> I1007 14:12:57.982128 28630 leveldb.cpp:174] Opened db in 8.195028ms
> I1007 14:12:57.982599 28630 leveldb.cpp:181] Compacted db in 446238ns
> I1007 14:12:57.982616 28630 leveldb.cpp:196] Created db iterator in 3650ns
> I1007 14:12:57.982622 28630 leveldb.cpp:202] Seeked to beginning of db in 
> 451ns
> I1007 14:12:57.982627 28630 leveldb.cpp:271] Iterated through 0 keys in the 
> db in 352ns
> I1007 14:12:57.982638 28630 replica.cpp:776] Replica recovered with log 
> positions 0 -> 0 with 1 holes and 0 unlearned
> I1007 14:12:57.983024 28645 recover.cpp:451] Starting replica recovery
> I1007 14:12:57.983127 28651 recover.cpp:477] Replica is in EMPTY status
> I1007 14:12:57.983459 28644 replica.cpp:673] Replica in EMPTY status received 
> a broadcasted recover request from __req_res__(6234)@172.30.2.161:38776
> I1007 14:12:57.983543 28651 recover.cpp:197] Received a recover response from 
> a replica in EMPTY status
> I1007 14:12:57.983680 28650 recover.cpp:568] Updating replica status to 
> STARTING
> I1007 14:12:57.983990 28648 master.cpp:380] Master 
> 76d4d55f-dcc6-4033-85d9-7ec97ef353cb 
> (ip-172-30-2-161.ec2.internal.mesosphere.io) started on 172.30.2.161:38776
> I1007 14:12:57.984007 28648 master.cpp:382] Flags at startup: --acls="" 
> --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
> --allocation_interval="1secs" --allocator="HierarchicalDRF" 
> --authenticate_agents="true" --authenticate_frameworks="true" 
> --authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
> --authenticate_http_readwrite="true" --authenticators="crammd5" 
> --authorizers="local" --credentials="/tmp/rVbcaO/credentials" 
> --framework_sorter="drf" --help="false" --hostname_lookup="true" 
> --http_authenticators="basic" --http_framework_authenticators="basic" 
> --initialize_driver_logging="true" --log_auto_initialize="true" 
> --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" 
> --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" 
> --quiet="false" --recovery_agent_removal_limit="100%" 
> --registry="replicated_log" --registry_fetch_timeout="1mins" 
> --registry_gc_interval="15mins" --registry_max_agent_age="2weeks" 
> --registry_max_agent_count="102400" --registry_store_timeout="100secs" 
> --registry_strict="false" --root_submissions="true" --user_sorter="drf" 
> --version="false" --webui_dir="/usr/local/share/mesos/webui" 
> --work_dir="/tmp/rVbcaO/master" --zk_session_timeout="10secs"
> I1007 14:12:57.984127 28648 master.cpp:432] Master only allowing 
> authenticated frameworks to register
> I1007 14:12:57.984134 28648 master.cpp:446] Master only allowing 
> authenticated agents to register
> I1007 14:12:57.984139 28648 master.cpp:459] Master only allowing 
> authenticated HTTP frameworks to register
> I1007 

[jira] [Commented] (MESOS-6780) ContentType/AgentAPIStreamTest.AttachContainerInput test fails reliably

2016-12-16 Thread Kevin Klues (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15755921#comment-15755921
 ] 

Kevin Klues commented on MESOS-6780:


I didn't get a chance to look at this today and am out for vacation for the 
next few weeks. [~anandmazumdar] or [~vinodkone] could you take a look?

> ContentType/AgentAPIStreamTest.AttachContainerInput test fails reliably
> ---
>
> Key: MESOS-6780
> URL: https://issues.apache.org/jira/browse/MESOS-6780
> Project: Mesos
>  Issue Type: Bug
> Environment: Mac OS 10.12, clang version 4.0.0 
> (http://llvm.org/git/clang 88800602c0baafb8739cb838c2fa3f5fb6cc6968) 
> (http://llvm.org/git/llvm 25801f0f22e178343ee1eadfb4c6cc058628280e), 
> libc++-513447dbb91dd555ea08297dbee6a1ceb6abdc46
>Reporter: Benjamin Bannier
> Attachments: attach_container_input_no_ssl.log
>
>
> The test {{ContentType/AgentAPIStreamTest.AttachContainerInput}} (both {{/0}} 
> and {{/1}}) fail consistently for me in an SSL-enabled, optimized build.
> {code}
> [==] Running 1 test from 1 test case.
> [--] Global test environment set-up.
> [--] 1 test from ContentType/AgentAPIStreamingTest
> [ RUN  ] ContentType/AgentAPIStreamingTest.AttachContainerInput/0
> I1212 17:11:12.371175 3971208128 cluster.cpp:160] Creating default 'local' 
> authorizer
> I1212 17:11:12.393844 17362944 master.cpp:380] Master 
> c752777c-d947-4a86-b382-643463866472 (172.18.8.114) started on 
> 172.18.8.114:51059
> I1212 17:11:12.393899 17362944 master.cpp:382] Flags at startup: --acls="" 
> --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
> --allocation_interval="1secs" --allocator="HierarchicalDRF" 
> --authenticate_agents="true" --authenticate_frameworks="true" 
> --authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
> --authenticate_http_readwrite="true" --authenticators="crammd5" 
> --authorizers="local" 
> --credentials="/private/var/folders/6t/yp_xgc8d6k32rpp0bsbfqm9mgp/T/F46yYV/credentials"
>  --framework_sorter="drf" --help="false" --hostname_lookup="true" 
> --http_authenticators="basic" --http_framework_authenticators="basic" 
> --initialize_driver_logging="true" --log_auto_initialize="true" 
> --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" 
> --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" 
> --quiet="false" --recovery_agent_removal_limit="100%" --registry="in_memory" 
> --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
> --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
> --registry_store_timeout="100secs" --registry_strict="false" 
> --root_submissions="true" --user_sorter="drf" --version="false" 
> --webui_dir="/usr/local/share/mesos/webui" 
> --work_dir="/private/var/folders/6t/yp_xgc8d6k32rpp0bsbfqm9mgp/T/F46yYV/master"
>  --zk_session_timeout="10secs"
> I1212 17:11:12.394670 17362944 master.cpp:432] Master only allowing 
> authenticated frameworks to register
> I1212 17:11:12.394682 17362944 master.cpp:446] Master only allowing 
> authenticated agents to register
> I1212 17:11:12.394691 17362944 master.cpp:459] Master only allowing 
> authenticated HTTP frameworks to register
> I1212 17:11:12.394701 17362944 credentials.hpp:37] Loading credentials for 
> authentication from 
> '/private/var/folders/6t/yp_xgc8d6k32rpp0bsbfqm9mgp/T/F46yYV/credentials'
> I1212 17:11:12.394959 17362944 master.cpp:504] Using default 'crammd5' 
> authenticator
> I1212 17:11:12.394996 17362944 authenticator.cpp:519] Initializing server SASL
> I1212 17:11:12.411406 17362944 http.cpp:922] Using default 'basic' HTTP 
> authenticator for realm 'mesos-master-readonly'
> I1212 17:11:12.411571 17362944 http.cpp:922] Using default 'basic' HTTP 
> authenticator for realm 'mesos-master-readwrite'
> I1212 17:11:12.411682 17362944 http.cpp:922] Using default 'basic' HTTP 
> authenticator for realm 'mesos-master-scheduler'
> I1212 17:11:12.411775 17362944 master.cpp:584] Authorization enabled
> I1212 17:11:12.413318 16289792 master.cpp:2045] Elected as the leading master!
> I1212 17:11:12.413377 16289792 master.cpp:1568] Recovering from registrar
> I1212 17:11:12.417582 14143488 registrar.cpp:362] Successfully fetched the 
> registry (0B) in 4.131072ms
> I1212 17:11:12.417667 14143488 registrar.cpp:461] Applied 1 operations in 
> 27us; attempting to update the registry
> I1212 17:11:12.421799 14143488 registrar.cpp:506] Successfully updated the 
> registry in 4.10496ms
> I1212 17:11:12.421835 14143488 registrar.cpp:392] Successfully recovered 
> registrar
> I1212 17:11:12.421998 17362944 master.cpp:1684] Recovered 0 agents from the 
> registry (136B); allowing 10mins for agents to re-register
> I1212 17:11:12.422780 3971208128 containerizer.cpp:220] 

[jira] [Comment Edited] (MESOS-6807) Design mapping between `TaskInfo` and Job Objects

2016-12-16 Thread Andrew Schwartzmeyer (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15755869#comment-15755869
 ] 

Andrew Schwartzmeyer edited comment on MESOS-6807 at 12/17/16 12:14 AM:


This is the investigation work for the eventual implementation in Mesos-6690.


was (Author: andschwa):
This is the investigation work for the eventual implementation in the linked 
issue.

> Design mapping between `TaskInfo` and Job Objects
> -
>
> Key: MESOS-6807
> URL: https://issues.apache.org/jira/browse/MESOS-6807
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
> Environment: Windows Server 2016
>Reporter: Andrew Schwartzmeyer
>Assignee: Andrew Schwartzmeyer
>  Labels: microsoft, windows
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> This issue starts tracking the work of correctly mapping Mesos's `TaskInfo` 
> APIs (as in resource usage limits of particular tasks scheduled on an agent) 
> to [Windows' Job 
> Objects|https://msdn.microsoft.com/en-us/library/windows/desktop/ms684161(v=vs.85).aspx],
>  which are akin to Linux's `cgroup`s.
> Initial time estimate is for the investigation, not implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6807) Design mapping between `TaskInfo` and Job Objects

2016-12-16 Thread Andrew Schwartzmeyer (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15755869#comment-15755869
 ] 

Andrew Schwartzmeyer commented on MESOS-6807:
-

This is the investigation work for the eventual implementation in the linked 
issue.

> Design mapping between `TaskInfo` and Job Objects
> -
>
> Key: MESOS-6807
> URL: https://issues.apache.org/jira/browse/MESOS-6807
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
> Environment: Windows Server 2016
>Reporter: Andrew Schwartzmeyer
>Assignee: Andrew Schwartzmeyer
>  Labels: microsoft, windows
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> This issue starts tracking the work of correctly mapping Mesos's `TaskInfo` 
> APIs (as in resource usage limits of particular tasks scheduled on an agent) 
> to [Windows' Job 
> Objects|https://msdn.microsoft.com/en-us/library/windows/desktop/ms684161(v=vs.85).aspx],
>  which are akin to Linux's `cgroup`s.
> Initial time estimate is for the investigation, not implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6809) libprocess's namespace "unix" conflicts with some preprocessor directives in gcc compiler

2016-12-16 Thread Igor Morozov (JIRA)
Igor Morozov created MESOS-6809:
---

 Summary: libprocess's namespace "unix" conflicts with some 
preprocessor directives in gcc compiler
 Key: MESOS-6809
 URL: https://issues.apache.org/jira/browse/MESOS-6809
 Project: Mesos
  Issue Type: Bug
  Components: libprocess
Affects Versions: 1.2.0
Reporter: Igor Morozov
Priority: Minor


libprocess uses namespace unix like this for example in address.hpp:

#ifndef __WINDOWS__
namespace unix {
class Address;
} // namespace unix {
#endif // __WINDOWS__

GCC defines preprocessor directives with the same name:

gcc -dM -E -std=gnu++11 - < /dev/null | grep unix
#define __unix__ 1
#define __unix 1
#define unix 1

that is causing namespace conflicts and compilation error.

g++ --version
g++ (Debian 4.9.2-10) 4.9.2
Copyright (C) 2014 Free Software Foundation, Inc.

the fix is to change -std=gnu++11 to -std=c++11 whenever possible



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6808) Refactor Docker::run to only take docker cli parameters

2016-12-16 Thread Zhitao Li (JIRA)
Zhitao Li created MESOS-6808:


 Summary: Refactor Docker::run to only take docker cli parameters
 Key: MESOS-6808
 URL: https://issues.apache.org/jira/browse/MESOS-6808
 Project: Mesos
  Issue Type: Task
  Components: docker
Reporter: Zhitao Li
Assignee: Zhitao Li
Priority: Minor


As we discussed, {{Docker::run}} in src/docker/docker.hpp should only 
understand docker cli options. The logic of creating these options should be 
refactored to another helper function.

This will also allow us to overcome the maximum 10 argument limit of GMOCK.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-1280) Add replace task primitive

2016-12-16 Thread Zhitao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15755230#comment-15755230
 ] 

Zhitao Li commented on MESOS-1280:
--

Hi, is there any common interest in pursuing this in the next 1-2 Mesos release 
cycles? Our organization is quite interested in adding this capability for a 
couple of reasons, and would be happy if some committer is willing to shepherd 
us.

Thanks!

> Add replace task primitive
> --
>
> Key: MESOS-1280
> URL: https://issues.apache.org/jira/browse/MESOS-1280
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, c++ api, master
>Reporter: Niklas Quarfot Nielsen
>  Labels: mesosphere
>
> Also along the lines of MESOS-938, replaceTask would one of a couple of 
> primitives needed to support various task replacement and scaling scenarios. 
> This replaceTask() version is significantly simpler than the first proposed 
> one; it's only responsibility is to run a new task info on a running tasks 
> resources.
> The running task will be killed as usual, but the newly freed resources will 
> never be announced and the new task will run on them instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6807) Design mapping between `TaskInfo` and Job Objects

2016-12-16 Thread Andrew Schwartzmeyer (JIRA)
Andrew Schwartzmeyer created MESOS-6807:
---

 Summary: Design mapping between `TaskInfo` and Job Objects
 Key: MESOS-6807
 URL: https://issues.apache.org/jira/browse/MESOS-6807
 Project: Mesos
  Issue Type: Bug
  Components: agent
 Environment: Windows Server 2016
Reporter: Andrew Schwartzmeyer
Assignee: Andrew Schwartzmeyer


This issue starts tracking the work of correctly mapping Mesos's `TaskInfo` 
APIs (as in resource usage limits of particular tasks scheduled on an agent) to 
[Windows' Job 
Objects|https://msdn.microsoft.com/en-us/library/windows/desktop/ms684161(v=vs.85).aspx],
 which are akin to Linux's `cgroup`s.

Initial time estimate is for the investigation, not implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6784) IOSwitchboardTest.KillSwitchboardContainerDestroyed is flaky

2016-12-16 Thread Anand Mazumdar (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15755184#comment-15755184
 ] 

Anand Mazumdar commented on MESOS-6784:
---

Committed a fix for the second log snippet that Jie posted around the test bug. 

{noformat}
commit 28eaa8df7c95130b0c244f7613ad506be899cafd
Author: Anand Mazumdar 
Date:   Wed Dec 14 17:40:47 2016 -0800

Fixed the 'IOSwitchboardTest.KillSwitchboardContainerDestroyed' test.

The container was launched with TTY enabled. This meant that
killing the switchboard would trigger the task to terminate
on its own owing to the "master" end of the TTY dying. This
would make it not go through the code path of the isolator
failing due to resource limit issue.

Review: https://reviews.apache.org/r/54770
{noformat}

The original log in the issue description is a separate issue in the 
switchboard code itself and I am working on that. This should make the CI green 
for now.

> IOSwitchboardTest.KillSwitchboardContainerDestroyed is flaky
> 
>
> Key: MESOS-6784
> URL: https://issues.apache.org/jira/browse/MESOS-6784
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Reporter: Neil Conway
>Assignee: Anand Mazumdar
>Priority: Blocker
>  Labels: mesosphere
>
> {noformat}
> [ RUN  ] IOSwitchboardTest.KillSwitchboardContainerDestroyed
> I1212 13:57:02.641043  2211 containerizer.cpp:220] Using isolation: 
> posix/cpu,filesystem/posix,network/cni
> W1212 13:57:02.641438  2211 backend.cpp:76] Failed to create 'overlay' 
> backend: OverlayBackend requires root privileges, but is running as user nrc
> W1212 13:57:02.641559  2211 backend.cpp:76] Failed to create 'bind' backend: 
> BindBackend requires root privileges
> I1212 13:57:02.642822  2268 containerizer.cpp:594] Recovering containerizer
> I1212 13:57:02.643975  2253 provisioner.cpp:253] Provisioner recovery complete
> I1212 13:57:02.644953  2255 containerizer.cpp:986] Starting container 
> 09e87380-00ab-4987-83c9-fa1c5d86717f for executor 'executor' of framework
> I1212 13:57:02.647004  2245 switchboard.cpp:430] Allocated pseudo terminal 
> '/dev/pts/54' for container 09e87380-00ab-4987-83c9-fa1c5d86717f
> I1212 13:57:02.652305  2245 switchboard.cpp:596] Created I/O switchboard 
> server (pid: 2705) listening on socket file 
> '/tmp/mesos-io-switchboard-b4af1c92-6633-44f3-9d35-e0e36edaf70a' for 
> container 09e87380-00ab-4987-83c9-fa1c5d86717f
> I1212 13:57:02.655513  2267 launcher.cpp:133] Forked child with pid '2706' 
> for container '09e87380-00ab-4987-83c9-fa1c5d86717f'
> I1212 13:57:02.655732  2267 containerizer.cpp:1621] Checkpointing container's 
> forked pid 2706 to 
> '/tmp/IOSwitchboardTest_KillSwitchboardContainerDestroyed_Me5CRx/meta/slaves/frameworks/executors/executor/runs/09e87380-00ab-4987-83c9-fa1c5d86717f/pids/forked.pid'
> I1212 13:57:02.726306  2265 containerizer.cpp:2463] Container 
> 09e87380-00ab-4987-83c9-fa1c5d86717f has exited
> I1212 13:57:02.726352  2265 containerizer.cpp:2100] Destroying container 
> 09e87380-00ab-4987-83c9-fa1c5d86717f in RUNNING state
> E1212 13:57:02.726495  2243 switchboard.cpp:861] Unexpected termination of 
> I/O switchboard server: 'IOSwitchboard' exited with signal: Killed for 
> container 09e87380-00ab-4987-83c9-fa1c5d86717f
> I1212 13:57:02.726563  2265 launcher.cpp:149] Asked to destroy container 
> 09e87380-00ab-4987-83c9-fa1c5d86717f
> E1212 13:57:02.783607  2228 switchboard.cpp:799] Failed to remove unix domain 
> socket file '/tmp/mesos-io-switchboard-b4af1c92-6633-44f3-9d35-e0e36edaf70a' 
> for container '09e87380-00ab-4987-83c9-fa1c5d86717f': No such file or 
> directory
> ../../mesos/src/tests/containerizer/io_switchboard_tests.cpp:661: Failure
> Value of: wait.get()->reasons().size() == 1
>   Actual: false
> Expected: true
> *** Aborted at 1481579822 (unix time) try "date -d @1481579822" if you are 
> using GNU date ***
> PC: @  0x1bf16d0 testing::UnitTest::AddTestPartResult()
> *** SIGSEGV (@0x0) received by PID 2211 (TID 0x7faed7d078c0) from PID 0; 
> stack trace: ***
> @ 0x7faecf855100 (unknown)
> @  0x1bf16d0 testing::UnitTest::AddTestPartResult()
> @  0x1be6247 testing::internal::AssertHelper::operator=()
> @  0x19ed751 
> mesos::internal::tests::IOSwitchboardTest_KillSwitchboardContainerDestroyed_Test::TestBody()
> @  0x1c0ed8c 
> testing::internal::HandleSehExceptionsInMethodIfSupported<>()
> @  0x1c09e74 
> testing::internal::HandleExceptionsInMethodIfSupported<>()
> @  0x1beb505 testing::Test::Run()
> @  0x1bebc88 testing::TestInfo::Run()
> @  0x1bec2ce testing::TestCase::Run()
> @  0x1bf2ba8