[jira] [Commented] (MESOS-6810) Tasks getting stuck in STAGING state when using unified containerizer
[ https://issues.apache.org/jira/browse/MESOS-6810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15756585#comment-15756585 ] Yu Yang commented on MESOS-6810: this is the output of {{curl -V}} {quote} curl 7.47.0 (x86_64-pc-linux-gnu) libcurl/7.47.0 GnuTLS/3.4.10 zlib/1.2.8 libidn/1.32 librtmp/2.3 Protocols: dict file ftp ftps gopher http https imap imaps ldap ldaps pop3 pop3s rtmp rtsp smb smbs smtp smtps telnet tftp Features: AsynchDNS IDN IPv6 Largefile GSS-API Kerberos SPNEGO NTLM NTLM_WB SSL libz TLS-SRP UnixSockets {quote} I should clarify that this problem also reproduced in a mesos cluster with nearly 30 machines, the error mesage is: {{Failed to launch container: Collect failed: Failed to perform 'curl': curl: (56) SSL read: error::lib(0):func(0):reason(0), errno 104; Container destroyed while provisioning images}} > Tasks getting stuck in STAGING state when using unified containerizer > - > > Key: MESOS-6810 > URL: https://issues.apache.org/jira/browse/MESOS-6810 > Project: Mesos > Issue Type: Bug > Components: containerization, docker >Affects Versions: 1.0.0, 1.0.1, 1.1.0 > Environment: *OS*: ubuntu16.04 64bit > *mesos*: 1.1.0, one master and one agent on same machine > *Agent flag*: {{sudo ./bin/mesos-agent.sh --master=192.168.1.192:5050 > --work_dir=/tmp/mesos_slave --image_providers=docker > --isolation=docker/runtime,filesystem/linux,cgroups/devices,gpu/nvidia > --containerizers=mesos,docker --executor_environment_variables="{}"}} >Reporter: Yu Yang > > when submit tasks using container settings like: > {code} > { > "container": { > "mesos": { > "image": { > "docker": { > "name": "nvidia/cuda" > }, > "type": "DOCKER" > } > }, >"type": "MESOS" > }, > } > {code} > then task will get stuck in STAGING state, and finally it will fail with > message {{Failed to launch container: Collect failed: Failed to perform > 'curl': curl: (56) GnuTLS recv error (-54): Error in pull function}} > this is the related log on > agent > {code} > I1217 13:05:35.406365 20780 slave.cpp:1539] Got assigned task > 'mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591' for framework > 02083c57-b2d9-4054-babe-90e962816813-0001 > I1217 13:05:35.406749 20780 slave.cpp:1701] Launching task > 'mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591' for framework > 02083c57-b2d9-4054-babe-90e962816813-0001 > I1217 13:05:35.406970 20780 paths.cpp:536] Trying to chown > '/tmp/mesos_slave/slaves/02083c57-b2d9-4054-babe-90e962816813-S0/frameworks/02083c57-b2d9-4054-babe-90e962816813-0001/executors/mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591/runs/8be3b5cd-afa3-4189-aa2a-f09d73529f8c' > to user 'root' > I1217 13:05:35.409272 20780 slave.cpp:6179] Launching executor > 'mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591' of framework > 02083c57-b2d9-4054-babe-90e962816813-0001 with resources cpus(*):0.1; > mem(*):32 in work directory > '/tmp/mesos_slave/slaves/02083c57-b2d9-4054-babe-90e962816813-S0/frameworks/02083c57-b2d9-4054-babe-90e962816813-0001/executors/mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591/runs/8be3b5cd-afa3-4189-aa2a-f09d73529f8c' > I1217 13:05:35.409958 20780 slave.cpp:1987] Queued task > 'mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591' for executor > 'mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591' of framework > 02083c57-b2d9-4054-babe-90e962816813-0001 > I1217 13:05:35.410163 20779 docker.cpp:1000] Skipping non-docker container > I1217 13:05:35.410636 20776 containerizer.cpp:938] Starting container > 8be3b5cd-afa3-4189-aa2a-f09d73529f8c for executor > 'mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591' of framework > 02083c57-b2d9-4054-babe-90e962816813-0001 > I1217 13:05:44.459362 20778 slave.cpp:4992] Terminating executor > ''cuda_mesos_nvidia_tf.72e9b9cf-8220-49bd-86fe-1667ee5e7a02' of framework > 02083c57-b2d9-4054-babe-90e962816813-0001' because it did not register within > 1mins > I1217 13:05:53.586819 20780 slave.cpp:5044] Current disk usage 63.59%. Max > allowed age: 1.848503351525151days > I1217 13:06:35.410905 20777 slave.cpp:4992] Terminating executor > ''mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591' of framework > 02083c57-b2d9-4054-babe-90e962816813-0001' because it did not register within > 1mins > I1217 13:06:35.411175 20780 containerizer.cpp:1950] Destroying container > 8be3b5cd-afa3-4189-aa2a-f09d73529f8c in PROVISIONING state > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6810) Tasks getting stuck in STAGING state when using unified containerizer
[ https://issues.apache.org/jira/browse/MESOS-6810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15756564#comment-15756564 ] haosdent commented on MESOS-6810: - Could you show the result of {code} curl -V {code} After some search, I think your curl is compiled with GNUTLS which may fail when fetching large files. https://curl.haxx.se/docs/ssl-compared.html > Tasks getting stuck in STAGING state when using unified containerizer > - > > Key: MESOS-6810 > URL: https://issues.apache.org/jira/browse/MESOS-6810 > Project: Mesos > Issue Type: Bug > Components: containerization, docker >Affects Versions: 1.0.0, 1.0.1, 1.1.0 > Environment: *OS*: ubuntu16.04 64bit > *mesos*: 1.1.0, one master and one agent on same machine > *Agent flag*: {{sudo ./bin/mesos-agent.sh --master=192.168.1.192:5050 > --work_dir=/tmp/mesos_slave --image_providers=docker > --isolation=docker/runtime,filesystem/linux,cgroups/devices,gpu/nvidia > --containerizers=mesos,docker --executor_environment_variables="{}"}} >Reporter: Yu Yang > > when submit tasks using container settings like: > {code} > { > "container": { > "mesos": { > "image": { > "docker": { > "name": "nvidia/cuda" > }, > "type": "DOCKER" > } > }, >"type": "MESOS" > }, > } > {code} > then task will get stuck in STAGING state, and finally it will fail with > message {{Failed to launch container: Collect failed: Failed to perform > 'curl': curl: (56) GnuTLS recv error (-54): Error in pull function}} > this is the related log on > agent > {code} > I1217 13:05:35.406365 20780 slave.cpp:1539] Got assigned task > 'mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591' for framework > 02083c57-b2d9-4054-babe-90e962816813-0001 > I1217 13:05:35.406749 20780 slave.cpp:1701] Launching task > 'mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591' for framework > 02083c57-b2d9-4054-babe-90e962816813-0001 > I1217 13:05:35.406970 20780 paths.cpp:536] Trying to chown > '/tmp/mesos_slave/slaves/02083c57-b2d9-4054-babe-90e962816813-S0/frameworks/02083c57-b2d9-4054-babe-90e962816813-0001/executors/mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591/runs/8be3b5cd-afa3-4189-aa2a-f09d73529f8c' > to user 'root' > I1217 13:05:35.409272 20780 slave.cpp:6179] Launching executor > 'mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591' of framework > 02083c57-b2d9-4054-babe-90e962816813-0001 with resources cpus(*):0.1; > mem(*):32 in work directory > '/tmp/mesos_slave/slaves/02083c57-b2d9-4054-babe-90e962816813-S0/frameworks/02083c57-b2d9-4054-babe-90e962816813-0001/executors/mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591/runs/8be3b5cd-afa3-4189-aa2a-f09d73529f8c' > I1217 13:05:35.409958 20780 slave.cpp:1987] Queued task > 'mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591' for executor > 'mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591' of framework > 02083c57-b2d9-4054-babe-90e962816813-0001 > I1217 13:05:35.410163 20779 docker.cpp:1000] Skipping non-docker container > I1217 13:05:35.410636 20776 containerizer.cpp:938] Starting container > 8be3b5cd-afa3-4189-aa2a-f09d73529f8c for executor > 'mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591' of framework > 02083c57-b2d9-4054-babe-90e962816813-0001 > I1217 13:05:44.459362 20778 slave.cpp:4992] Terminating executor > ''cuda_mesos_nvidia_tf.72e9b9cf-8220-49bd-86fe-1667ee5e7a02' of framework > 02083c57-b2d9-4054-babe-90e962816813-0001' because it did not register within > 1mins > I1217 13:05:53.586819 20780 slave.cpp:5044] Current disk usage 63.59%. Max > allowed age: 1.848503351525151days > I1217 13:06:35.410905 20777 slave.cpp:4992] Terminating executor > ''mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591' of framework > 02083c57-b2d9-4054-babe-90e962816813-0001' because it did not register within > 1mins > I1217 13:06:35.411175 20780 containerizer.cpp:1950] Destroying container > 8be3b5cd-afa3-4189-aa2a-f09d73529f8c in PROVISIONING state > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6810) Tasks getting stuck in STAGING state when using unified containerizer
[ https://issues.apache.org/jira/browse/MESOS-6810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] haosdent updated MESOS-6810: Description: when submit tasks using container settings like: {code} { "container": { "mesos": { "image": { "docker": { "name": "nvidia/cuda" }, "type": "DOCKER" } }, "type": "MESOS" }, } {code} then task will get stuck in STAGING state, and finally it will fail with message {{Failed to launch container: Collect failed: Failed to perform 'curl': curl: (56) GnuTLS recv error (-54): Error in pull function}} this is the related log on agent {code} I1217 13:05:35.406365 20780 slave.cpp:1539] Got assigned task 'mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591' for framework 02083c57-b2d9-4054-babe-90e962816813-0001 I1217 13:05:35.406749 20780 slave.cpp:1701] Launching task 'mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591' for framework 02083c57-b2d9-4054-babe-90e962816813-0001 I1217 13:05:35.406970 20780 paths.cpp:536] Trying to chown '/tmp/mesos_slave/slaves/02083c57-b2d9-4054-babe-90e962816813-S0/frameworks/02083c57-b2d9-4054-babe-90e962816813-0001/executors/mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591/runs/8be3b5cd-afa3-4189-aa2a-f09d73529f8c' to user 'root' I1217 13:05:35.409272 20780 slave.cpp:6179] Launching executor 'mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591' of framework 02083c57-b2d9-4054-babe-90e962816813-0001 with resources cpus(*):0.1; mem(*):32 in work directory '/tmp/mesos_slave/slaves/02083c57-b2d9-4054-babe-90e962816813-S0/frameworks/02083c57-b2d9-4054-babe-90e962816813-0001/executors/mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591/runs/8be3b5cd-afa3-4189-aa2a-f09d73529f8c' I1217 13:05:35.409958 20780 slave.cpp:1987] Queued task 'mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591' for executor 'mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591' of framework 02083c57-b2d9-4054-babe-90e962816813-0001 I1217 13:05:35.410163 20779 docker.cpp:1000] Skipping non-docker container I1217 13:05:35.410636 20776 containerizer.cpp:938] Starting container 8be3b5cd-afa3-4189-aa2a-f09d73529f8c for executor 'mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591' of framework 02083c57-b2d9-4054-babe-90e962816813-0001 I1217 13:05:44.459362 20778 slave.cpp:4992] Terminating executor ''cuda_mesos_nvidia_tf.72e9b9cf-8220-49bd-86fe-1667ee5e7a02' of framework 02083c57-b2d9-4054-babe-90e962816813-0001' because it did not register within 1mins I1217 13:05:53.586819 20780 slave.cpp:5044] Current disk usage 63.59%. Max allowed age: 1.848503351525151days I1217 13:06:35.410905 20777 slave.cpp:4992] Terminating executor ''mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591' of framework 02083c57-b2d9-4054-babe-90e962816813-0001' because it did not register within 1mins I1217 13:06:35.411175 20780 containerizer.cpp:1950] Destroying container 8be3b5cd-afa3-4189-aa2a-f09d73529f8c in PROVISIONING state {code} was: when submit tasks using container settings like: {code} { "container": { "mesos": { "image": { "docker": { "name": "nvidia/cuda" }, "type": "DOCKER" } }, "type": "MESOS" }, } {code} then task will get stuck in STAGING state, and finally it will fail with message {{Failed to launch container: Collect failed: Failed to perform 'curl': curl: (56) GnuTLS recv error (-54): Error in pull function}} this is the related log on agent {quote} I1217 13:05:35.406365 20780 slave.cpp:1539] Got assigned task 'mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591' for framework 02083c57-b2d9-4054-babe-90e962816813-0001 I1217 13:05:35.406749 20780 slave.cpp:1701] Launching task 'mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591' for framework 02083c57-b2d9-4054-babe-90e962816813-0001 I1217 13:05:35.406970 20780 paths.cpp:536] Trying to chown '/tmp/mesos_slave/slaves/02083c57-b2d9-4054-babe-90e962816813-S0/frameworks/02083c57-b2d9-4054-babe-90e962816813-0001/executors/mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591/runs/8be3b5cd-afa3-4189-aa2a-f09d73529f8c' to user 'root' I1217 13:05:35.409272 20780 slave.cpp:6179] Launching executor 'mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591' of framework 02083c57-b2d9-4054-babe-90e962816813-0001 with resources cpus(*):0.1; mem(*):32 in work directory
[jira] [Commented] (MESOS-6808) Refactor Docker::run to only take docker cli parameters
[ https://issues.apache.org/jira/browse/MESOS-6808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15756528#comment-15756528 ] haosdent commented on MESOS-6808: - +1, actually this have affected we add functions to the docker containerizer. > Refactor Docker::run to only take docker cli parameters > --- > > Key: MESOS-6808 > URL: https://issues.apache.org/jira/browse/MESOS-6808 > Project: Mesos > Issue Type: Task > Components: docker >Reporter: Zhitao Li >Assignee: Zhitao Li > > As we discussed, {{Docker::run}} in src/docker/docker.hpp should only > understand docker cli options. The logic of creating these options should be > refactored to another helper function. > This will also allow us to overcome the maximum 10 argument limit of GMOCK. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6808) Refactor Docker::run to only take docker cli parameters
[ https://issues.apache.org/jira/browse/MESOS-6808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] haosdent updated MESOS-6808: Priority: Major (was: Minor) > Refactor Docker::run to only take docker cli parameters > --- > > Key: MESOS-6808 > URL: https://issues.apache.org/jira/browse/MESOS-6808 > Project: Mesos > Issue Type: Task > Components: docker >Reporter: Zhitao Li >Assignee: Zhitao Li > > As we discussed, {{Docker::run}} in src/docker/docker.hpp should only > understand docker cli options. The logic of creating these options should be > refactored to another helper function. > This will also allow us to overcome the maximum 10 argument limit of GMOCK. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6810) Tasks getting stuck in STAGING state when using unified containerizer
[ https://issues.apache.org/jira/browse/MESOS-6810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15756453#comment-15756453 ] Yu Yang commented on MESOS-6810: Actually I've pulled nvidia/cuda to local and it works well with docker containerizer, I also tried passing USTC docker mirror to agent(using --docker_registry), but no luck > Tasks getting stuck in STAGING state when using unified containerizer > - > > Key: MESOS-6810 > URL: https://issues.apache.org/jira/browse/MESOS-6810 > Project: Mesos > Issue Type: Bug > Components: containerization, docker >Affects Versions: 1.0.0, 1.0.1, 1.1.0 > Environment: *OS*: ubuntu16.04 64bit > *mesos*: 1.1.0, one master and one agent on same machine > *Agent flag*: {{sudo ./bin/mesos-agent.sh --master=192.168.1.192:5050 > --work_dir=/tmp/mesos_slave --image_providers=docker > --isolation=docker/runtime,filesystem/linux,cgroups/devices,gpu/nvidia > --containerizers=mesos,docker --executor_environment_variables="{}"}} >Reporter: Yu Yang > > when submit tasks using container settings like: > {code} > { > "container": { > "mesos": { > "image": { > "docker": { > "name": "nvidia/cuda" > }, > "type": "DOCKER" > } > }, >"type": "MESOS" > }, > } > {code} > then task will get stuck in STAGING state, and finally it will fail with > message {{Failed to launch container: Collect failed: Failed to perform > 'curl': curl: (56) GnuTLS recv error (-54): Error in pull function}} > this is the related log on > agent > {quote} > I1217 13:05:35.406365 20780 slave.cpp:1539] Got assigned task > 'mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591' for framework > 02083c57-b2d9-4054-babe-90e962816813-0001 > I1217 13:05:35.406749 20780 slave.cpp:1701] Launching task > 'mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591' for framework > 02083c57-b2d9-4054-babe-90e962816813-0001 > I1217 13:05:35.406970 20780 paths.cpp:536] Trying to chown > '/tmp/mesos_slave/slaves/02083c57-b2d9-4054-babe-90e962816813-S0/frameworks/02083c57-b2d9-4054-babe-90e962816813-0001/executors/mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591/runs/8be3b5cd-afa3-4189-aa2a-f09d73529f8c' > to user 'root' > I1217 13:05:35.409272 20780 slave.cpp:6179] Launching executor > 'mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591' of framework > 02083c57-b2d9-4054-babe-90e962816813-0001 with resources cpus(*):0.1; > mem(*):32 in work directory > '/tmp/mesos_slave/slaves/02083c57-b2d9-4054-babe-90e962816813-S0/frameworks/02083c57-b2d9-4054-babe-90e962816813-0001/executors/mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591/runs/8be3b5cd-afa3-4189-aa2a-f09d73529f8c' > I1217 13:05:35.409958 20780 slave.cpp:1987] Queued task > 'mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591' for executor > 'mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591' of framework > 02083c57-b2d9-4054-babe-90e962816813-0001 > I1217 13:05:35.410163 20779 docker.cpp:1000] Skipping non-docker container > I1217 13:05:35.410636 20776 containerizer.cpp:938] Starting container > 8be3b5cd-afa3-4189-aa2a-f09d73529f8c for executor > 'mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591' of framework > 02083c57-b2d9-4054-babe-90e962816813-0001 > I1217 13:05:44.459362 20778 slave.cpp:4992] Terminating executor > ''cuda_mesos_nvidia_tf.72e9b9cf-8220-49bd-86fe-1667ee5e7a02' of framework > 02083c57-b2d9-4054-babe-90e962816813-0001' because it did not register within > 1mins > I1217 13:05:53.586819 20780 slave.cpp:5044] Current disk usage 63.59%. Max > allowed age: 1.848503351525151days > I1217 13:06:35.410905 20777 slave.cpp:4992] Terminating executor > ''mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591' of framework > 02083c57-b2d9-4054-babe-90e962816813-0001' because it did not register within > 1mins > I1217 13:06:35.411175 20780 containerizer.cpp:1950] Destroying container > 8be3b5cd-afa3-4189-aa2a-f09d73529f8c in PROVISIONING state > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6810) Tasks getting stuck in STAGING state when using unified containerizer
[ https://issues.apache.org/jira/browse/MESOS-6810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15756414#comment-15756414 ] Yu Yang commented on MESOS-6810: {quote} * Trying 52.45.221.131... * Connected to registry-1.docker.io (52.45.221.131) port 443 (#0) * found 173 certificates in /etc/ssl/certs/ca-certificates.crt * found 697 certificates in /etc/ssl/certs * ALPN, offering http/1.1 * SSL connection using TLS1.2 / ECDHE_RSA_AES_128_GCM_SHA256 *server certificate verification OK *server certificate status verification SKIPPED *common name: *.docker.io (matched) *server certificate expiration date OK *server certificate activation date OK *certificate public key: RSA *certificate version: #3 *subject: OU=GT98568428,OU=See www.rapidssl.com/resources/cps (c)15,OU=Domain Control Validated - RapidSSL(R),CN=*.docker.io *start date: Thu, 19 Mar 2015 17:34:32 GMT *expire date: Sat, 21 Apr 2018 01:51:52 GMT *issuer: C=US,O=GeoTrust Inc.,CN=RapidSSL SHA256 CA - G3 *compression: NULL * ALPN, server did not agree to a protocol > GET /v2/nvidia/cuda/manifests/latest HTTP/1.1 > Host: registry-1.docker.io > User-Agent: curl/7.47.0 > Accept: */* > < HTTP/1.1 401 Unauthorized < Content-Type: application/json; charset=utf-8 < Docker-Distribution-Api-Version: registry/2.0 < Www-Authenticate: Bearer realm="https://auth.docker.io/token",service="registry.docker.io",scope="repository:nvidia/cuda:pull; < Date: Sat, 17 Dec 2016 05:50:45 GMT < Content-Length: 143 < Strict-Transport-Security: max-age=31536000 < {"errors":[{"code":"UNAUTHORIZED","message":"authentication required","detail":[{"Type":"repository","Name":"nvidia/cuda","Action":"pull"}]}]} * Connection #0 to host registry-1.docker.io left intact {quote} > Tasks getting stuck in STAGING state when using unified containerizer > - > > Key: MESOS-6810 > URL: https://issues.apache.org/jira/browse/MESOS-6810 > Project: Mesos > Issue Type: Bug > Components: containerization, docker >Affects Versions: 1.0.0, 1.0.1, 1.1.0 > Environment: *OS*: ubuntu16.04 64bit > *mesos*: 1.1.0, one master and one agent on same machine > *Agent flag*: {{sudo ./bin/mesos-agent.sh --master=192.168.1.192:5050 > --work_dir=/tmp/mesos_slave --image_providers=docker > --isolation=docker/runtime,filesystem/linux,cgroups/devices,gpu/nvidia > --containerizers=mesos,docker --executor_environment_variables="{}"}} >Reporter: Yu Yang > > when submit tasks using container settings like: > {code} > { > "container": { > "mesos": { > "image": { > "docker": { > "name": "nvidia/cuda" > }, > "type": "DOCKER" > } > }, >"type": "MESOS" > }, > } > {code} > then task will get stuck in STAGING state, and finally it will fail with > message {{Failed to launch container: Collect failed: Failed to perform > 'curl': curl: (56) GnuTLS recv error (-54): Error in pull function}} > this is the related log on > agent > {quote} > I1217 13:05:35.406365 20780 slave.cpp:1539] Got assigned task > 'mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591' for framework > 02083c57-b2d9-4054-babe-90e962816813-0001 > I1217 13:05:35.406749 20780 slave.cpp:1701] Launching task > 'mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591' for framework > 02083c57-b2d9-4054-babe-90e962816813-0001 > I1217 13:05:35.406970 20780 paths.cpp:536] Trying to chown > '/tmp/mesos_slave/slaves/02083c57-b2d9-4054-babe-90e962816813-S0/frameworks/02083c57-b2d9-4054-babe-90e962816813-0001/executors/mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591/runs/8be3b5cd-afa3-4189-aa2a-f09d73529f8c' > to user 'root' > I1217 13:05:35.409272 20780 slave.cpp:6179] Launching executor > 'mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591' of framework > 02083c57-b2d9-4054-babe-90e962816813-0001 with resources cpus(*):0.1; > mem(*):32 in work directory > '/tmp/mesos_slave/slaves/02083c57-b2d9-4054-babe-90e962816813-S0/frameworks/02083c57-b2d9-4054-babe-90e962816813-0001/executors/mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591/runs/8be3b5cd-afa3-4189-aa2a-f09d73529f8c' > I1217 13:05:35.409958 20780 slave.cpp:1987] Queued task > 'mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591' for executor > 'mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591' of framework > 02083c57-b2d9-4054-babe-90e962816813-0001 > I1217 13:05:35.410163 20779 docker.cpp:1000] Skipping non-docker container > I1217 13:05:35.410636 20776 containerizer.cpp:938] Starting container > 8be3b5cd-afa3-4189-aa2a-f09d73529f8c for
[jira] [Commented] (MESOS-6810) Tasks getting stuck in STAGING state when using unified containerizer
[ https://issues.apache.org/jira/browse/MESOS-6810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15756410#comment-15756410 ] Jie Yu commented on MESOS-6810: --- Can you {noformat}curl -vvv https://registry-1.docker.io/v2/nvidia/cuda/manifests/latest{noformat} and see what's the output? > Tasks getting stuck in STAGING state when using unified containerizer > - > > Key: MESOS-6810 > URL: https://issues.apache.org/jira/browse/MESOS-6810 > Project: Mesos > Issue Type: Bug > Components: containerization, docker >Affects Versions: 1.0.0, 1.0.1, 1.1.0 > Environment: *OS*: ubuntu16.04 64bit > *mesos*: 1.1.0, one master and one agent on same machine > *Agent flag*: {{sudo ./bin/mesos-agent.sh --master=192.168.1.192:5050 > --work_dir=/tmp/mesos_slave --image_providers=docker > --isolation=docker/runtime,filesystem/linux,cgroups/devices,gpu/nvidia > --containerizers=mesos,docker --executor_environment_variables="{}"}} >Reporter: Yu Yang > > when submit tasks using container settings like: > {code} > { > "container": { > "mesos": { > "image": { > "docker": { > "name": "nvidia/cuda" > }, > "type": "DOCKER" > } > }, >"type": "MESOS" > }, > } > {code} > then task will get stuck in STAGING state, and finally it will fail with > message {{Failed to launch container: Collect failed: Failed to perform > 'curl': curl: (56) GnuTLS recv error (-54): Error in pull function}} > this is the related log on > agent > {quote} > I1217 13:05:35.406365 20780 slave.cpp:1539] Got assigned task > 'mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591' for framework > 02083c57-b2d9-4054-babe-90e962816813-0001 > I1217 13:05:35.406749 20780 slave.cpp:1701] Launching task > 'mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591' for framework > 02083c57-b2d9-4054-babe-90e962816813-0001 > I1217 13:05:35.406970 20780 paths.cpp:536] Trying to chown > '/tmp/mesos_slave/slaves/02083c57-b2d9-4054-babe-90e962816813-S0/frameworks/02083c57-b2d9-4054-babe-90e962816813-0001/executors/mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591/runs/8be3b5cd-afa3-4189-aa2a-f09d73529f8c' > to user 'root' > I1217 13:05:35.409272 20780 slave.cpp:6179] Launching executor > 'mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591' of framework > 02083c57-b2d9-4054-babe-90e962816813-0001 with resources cpus(*):0.1; > mem(*):32 in work directory > '/tmp/mesos_slave/slaves/02083c57-b2d9-4054-babe-90e962816813-S0/frameworks/02083c57-b2d9-4054-babe-90e962816813-0001/executors/mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591/runs/8be3b5cd-afa3-4189-aa2a-f09d73529f8c' > I1217 13:05:35.409958 20780 slave.cpp:1987] Queued task > 'mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591' for executor > 'mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591' of framework > 02083c57-b2d9-4054-babe-90e962816813-0001 > I1217 13:05:35.410163 20779 docker.cpp:1000] Skipping non-docker container > I1217 13:05:35.410636 20776 containerizer.cpp:938] Starting container > 8be3b5cd-afa3-4189-aa2a-f09d73529f8c for executor > 'mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591' of framework > 02083c57-b2d9-4054-babe-90e962816813-0001 > I1217 13:05:44.459362 20778 slave.cpp:4992] Terminating executor > ''cuda_mesos_nvidia_tf.72e9b9cf-8220-49bd-86fe-1667ee5e7a02' of framework > 02083c57-b2d9-4054-babe-90e962816813-0001' because it did not register within > 1mins > I1217 13:05:53.586819 20780 slave.cpp:5044] Current disk usage 63.59%. Max > allowed age: 1.848503351525151days > I1217 13:06:35.410905 20777 slave.cpp:4992] Terminating executor > ''mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591' of framework > 02083c57-b2d9-4054-babe-90e962816813-0001' because it did not register within > 1mins > I1217 13:06:35.411175 20780 containerizer.cpp:1950] Destroying container > 8be3b5cd-afa3-4189-aa2a-f09d73529f8c in PROVISIONING state > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6810) Tasks getting stuck in STAGING state when using unified containerizer
[ https://issues.apache.org/jira/browse/MESOS-6810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yu Yang updated MESOS-6810: --- Description: when submit tasks using container settings like: {code} { "container": { "mesos": { "image": { "docker": { "name": "nvidia/cuda" }, "type": "DOCKER" } }, "type": "MESOS" }, } {code} then task will get stuck in STAGING state, and finally it will fail with message {{Failed to launch container: Collect failed: Failed to perform 'curl': curl: (56) GnuTLS recv error (-54): Error in pull function}} this is the related log on agent {quote} I1217 13:05:35.406365 20780 slave.cpp:1539] Got assigned task 'mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591' for framework 02083c57-b2d9-4054-babe-90e962816813-0001 I1217 13:05:35.406749 20780 slave.cpp:1701] Launching task 'mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591' for framework 02083c57-b2d9-4054-babe-90e962816813-0001 I1217 13:05:35.406970 20780 paths.cpp:536] Trying to chown '/tmp/mesos_slave/slaves/02083c57-b2d9-4054-babe-90e962816813-S0/frameworks/02083c57-b2d9-4054-babe-90e962816813-0001/executors/mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591/runs/8be3b5cd-afa3-4189-aa2a-f09d73529f8c' to user 'root' I1217 13:05:35.409272 20780 slave.cpp:6179] Launching executor 'mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591' of framework 02083c57-b2d9-4054-babe-90e962816813-0001 with resources cpus(*):0.1; mem(*):32 in work directory '/tmp/mesos_slave/slaves/02083c57-b2d9-4054-babe-90e962816813-S0/frameworks/02083c57-b2d9-4054-babe-90e962816813-0001/executors/mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591/runs/8be3b5cd-afa3-4189-aa2a-f09d73529f8c' I1217 13:05:35.409958 20780 slave.cpp:1987] Queued task 'mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591' for executor 'mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591' of framework 02083c57-b2d9-4054-babe-90e962816813-0001 I1217 13:05:35.410163 20779 docker.cpp:1000] Skipping non-docker container I1217 13:05:35.410636 20776 containerizer.cpp:938] Starting container 8be3b5cd-afa3-4189-aa2a-f09d73529f8c for executor 'mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591' of framework 02083c57-b2d9-4054-babe-90e962816813-0001 I1217 13:05:44.459362 20778 slave.cpp:4992] Terminating executor ''cuda_mesos_nvidia_tf.72e9b9cf-8220-49bd-86fe-1667ee5e7a02' of framework 02083c57-b2d9-4054-babe-90e962816813-0001' because it did not register within 1mins I1217 13:05:53.586819 20780 slave.cpp:5044] Current disk usage 63.59%. Max allowed age: 1.848503351525151days I1217 13:06:35.410905 20777 slave.cpp:4992] Terminating executor ''mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591' of framework 02083c57-b2d9-4054-babe-90e962816813-0001' because it did not register within 1mins I1217 13:06:35.411175 20780 containerizer.cpp:1950] Destroying container 8be3b5cd-afa3-4189-aa2a-f09d73529f8c in PROVISIONING state {quote} was: when submit tasks using container settings like: { "container": { "mesos": { "image": { "docker": { "name": "nvidia/cuda" }, "type": "DOCKER" } }, "type": "MESOS" }, } then task will get stuck in STAGING state, and finally it will fail with message {{Failed to launch container: Collect failed: Failed to perform 'curl': curl: (56) GnuTLS recv error (-54): Error in pull function}} this is the related log on agent {quote} I1217 13:05:35.406365 20780 slave.cpp:1539] Got assigned task 'mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591' for framework 02083c57-b2d9-4054-babe-90e962816813-0001 I1217 13:05:35.406749 20780 slave.cpp:1701] Launching task 'mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591' for framework 02083c57-b2d9-4054-babe-90e962816813-0001 I1217 13:05:35.406970 20780 paths.cpp:536] Trying to chown '/tmp/mesos_slave/slaves/02083c57-b2d9-4054-babe-90e962816813-S0/frameworks/02083c57-b2d9-4054-babe-90e962816813-0001/executors/mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591/runs/8be3b5cd-afa3-4189-aa2a-f09d73529f8c' to user 'root' I1217 13:05:35.409272 20780 slave.cpp:6179] Launching executor 'mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591' of framework 02083c57-b2d9-4054-babe-90e962816813-0001 with resources cpus(*):0.1; mem(*):32 in work directory
[jira] [Created] (MESOS-6810) Tasks getting stuck in STAGING state when using unified containerizer
Yu Yang created MESOS-6810: -- Summary: Tasks getting stuck in STAGING state when using unified containerizer Key: MESOS-6810 URL: https://issues.apache.org/jira/browse/MESOS-6810 Project: Mesos Issue Type: Bug Components: containerization, docker Affects Versions: 1.1.0, 1.0.1, 1.0.0 Environment: *OS*: ubuntu16.04 64bit *mesos*: 1.1.0, one master and one agent on same machine *Agent flag*: {{sudo ./bin/mesos-agent.sh --master=192.168.1.192:5050 --work_dir=/tmp/mesos_slave --image_providers=docker --isolation=docker/runtime,filesystem/linux,cgroups/devices,gpu/nvidia --containerizers=mesos,docker --executor_environment_variables="{}"}} Reporter: Yu Yang when submit tasks using container settings like: { "container": { "mesos": { "image": { "docker": { "name": "nvidia/cuda" }, "type": "DOCKER" } }, "type": "MESOS" }, } then task will get stuck in STAGING state, and finally it will fail with message {{Failed to launch container: Collect failed: Failed to perform 'curl': curl: (56) GnuTLS recv error (-54): Error in pull function}} this is the related log on agent {quote} I1217 13:05:35.406365 20780 slave.cpp:1539] Got assigned task 'mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591' for framework 02083c57-b2d9-4054-babe-90e962816813-0001 I1217 13:05:35.406749 20780 slave.cpp:1701] Launching task 'mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591' for framework 02083c57-b2d9-4054-babe-90e962816813-0001 I1217 13:05:35.406970 20780 paths.cpp:536] Trying to chown '/tmp/mesos_slave/slaves/02083c57-b2d9-4054-babe-90e962816813-S0/frameworks/02083c57-b2d9-4054-babe-90e962816813-0001/executors/mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591/runs/8be3b5cd-afa3-4189-aa2a-f09d73529f8c' to user 'root' I1217 13:05:35.409272 20780 slave.cpp:6179] Launching executor 'mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591' of framework 02083c57-b2d9-4054-babe-90e962816813-0001 with resources cpus(*):0.1; mem(*):32 in work directory '/tmp/mesos_slave/slaves/02083c57-b2d9-4054-babe-90e962816813-S0/frameworks/02083c57-b2d9-4054-babe-90e962816813-0001/executors/mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591/runs/8be3b5cd-afa3-4189-aa2a-f09d73529f8c' I1217 13:05:35.409958 20780 slave.cpp:1987] Queued task 'mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591' for executor 'mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591' of framework 02083c57-b2d9-4054-babe-90e962816813-0001 I1217 13:05:35.410163 20779 docker.cpp:1000] Skipping non-docker container I1217 13:05:35.410636 20776 containerizer.cpp:938] Starting container 8be3b5cd-afa3-4189-aa2a-f09d73529f8c for executor 'mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591' of framework 02083c57-b2d9-4054-babe-90e962816813-0001 I1217 13:05:44.459362 20778 slave.cpp:4992] Terminating executor ''cuda_mesos_nvidia_tf.72e9b9cf-8220-49bd-86fe-1667ee5e7a02' of framework 02083c57-b2d9-4054-babe-90e962816813-0001' because it did not register within 1mins I1217 13:05:53.586819 20780 slave.cpp:5044] Current disk usage 63.59%. Max allowed age: 1.848503351525151days I1217 13:06:35.410905 20777 slave.cpp:4992] Terminating executor ''mesos_containerizer_test.2a845a72-7b54-4a95-b6fa-6aeda8c6b591' of framework 02083c57-b2d9-4054-babe-90e962816813-0001' because it did not register within 1mins I1217 13:06:35.411175 20780 containerizer.cpp:1950] Destroying container 8be3b5cd-afa3-4189-aa2a-f09d73529f8c in PROVISIONING state {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6336) SlaveTest.KillTaskGroupBetweenRunTaskParts is flaky
[ https://issues.apache.org/jira/browse/MESOS-6336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15756044#comment-15756044 ] Greg Mann commented on MESOS-6336: -- Hi [~a10gupta], I'm very sorry for the long delay in my reply! I just took a look at your patch, thanks for submitting it! While the patch does prevent the segfault, I would prefer to get to the bottom of the real issue: why are we attempting to remove a framework which is not in the {{frameworks}} map? I added some additional logging output, and it looks like we're definitely executing {{Slave::removeFramework}} and erasing the FrameworkID from the map before {{Slave::finalize}} gets executed. It's not immediately clear to me why {{frameworks.keys()}} would return the FrameworkID, but then we would segfault when attempting to access the value of that key. As [~vinodkone] suggested, it appears to me that the framework has been completely removed from the map before {{Slave::finalize}} is called. I notice that in {{shutdownFramework}}, we log a warning and return early if {{framework == nullptr}}, so perhaps we do expect to attempt this operation on nonexistent frameworks sometimes. Nonetheless, let's do some investigation and nail down the root cause of the segfault. Would you like to get on a hangout early next week and troubleshoot? > SlaveTest.KillTaskGroupBetweenRunTaskParts is flaky > --- > > Key: MESOS-6336 > URL: https://issues.apache.org/jira/browse/MESOS-6336 > Project: Mesos > Issue Type: Bug > Components: agent >Reporter: Greg Mann >Assignee: Abhishek Dasgupta > Labels: mesosphere > > The test {{SlaveTest.KillTaskGroupBetweenRunTaskParts}} sometimes segfaults > during the agent's {{finalize()}} method. This was observed on our internal > CI, on Fedora with libev, without SSL: > {code} > [ RUN ] SlaveTest.KillTaskGroupBetweenRunTaskParts > I1007 14:12:57.973811 28630 cluster.cpp:158] Creating default 'local' > authorizer > I1007 14:12:57.982128 28630 leveldb.cpp:174] Opened db in 8.195028ms > I1007 14:12:57.982599 28630 leveldb.cpp:181] Compacted db in 446238ns > I1007 14:12:57.982616 28630 leveldb.cpp:196] Created db iterator in 3650ns > I1007 14:12:57.982622 28630 leveldb.cpp:202] Seeked to beginning of db in > 451ns > I1007 14:12:57.982627 28630 leveldb.cpp:271] Iterated through 0 keys in the > db in 352ns > I1007 14:12:57.982638 28630 replica.cpp:776] Replica recovered with log > positions 0 -> 0 with 1 holes and 0 unlearned > I1007 14:12:57.983024 28645 recover.cpp:451] Starting replica recovery > I1007 14:12:57.983127 28651 recover.cpp:477] Replica is in EMPTY status > I1007 14:12:57.983459 28644 replica.cpp:673] Replica in EMPTY status received > a broadcasted recover request from __req_res__(6234)@172.30.2.161:38776 > I1007 14:12:57.983543 28651 recover.cpp:197] Received a recover response from > a replica in EMPTY status > I1007 14:12:57.983680 28650 recover.cpp:568] Updating replica status to > STARTING > I1007 14:12:57.983990 28648 master.cpp:380] Master > 76d4d55f-dcc6-4033-85d9-7ec97ef353cb > (ip-172-30-2-161.ec2.internal.mesosphere.io) started on 172.30.2.161:38776 > I1007 14:12:57.984007 28648 master.cpp:382] Flags at startup: --acls="" > --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" > --allocation_interval="1secs" --allocator="HierarchicalDRF" > --authenticate_agents="true" --authenticate_frameworks="true" > --authenticate_http_frameworks="true" --authenticate_http_readonly="true" > --authenticate_http_readwrite="true" --authenticators="crammd5" > --authorizers="local" --credentials="/tmp/rVbcaO/credentials" > --framework_sorter="drf" --help="false" --hostname_lookup="true" > --http_authenticators="basic" --http_framework_authenticators="basic" > --initialize_driver_logging="true" --log_auto_initialize="true" > --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" > --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" > --quiet="false" --recovery_agent_removal_limit="100%" > --registry="replicated_log" --registry_fetch_timeout="1mins" > --registry_gc_interval="15mins" --registry_max_agent_age="2weeks" > --registry_max_agent_count="102400" --registry_store_timeout="100secs" > --registry_strict="false" --root_submissions="true" --user_sorter="drf" > --version="false" --webui_dir="/usr/local/share/mesos/webui" > --work_dir="/tmp/rVbcaO/master" --zk_session_timeout="10secs" > I1007 14:12:57.984127 28648 master.cpp:432] Master only allowing > authenticated frameworks to register > I1007 14:12:57.984134 28648 master.cpp:446] Master only allowing > authenticated agents to register > I1007 14:12:57.984139 28648 master.cpp:459] Master only allowing > authenticated HTTP frameworks to register > I1007
[jira] [Commented] (MESOS-6780) ContentType/AgentAPIStreamTest.AttachContainerInput test fails reliably
[ https://issues.apache.org/jira/browse/MESOS-6780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15755921#comment-15755921 ] Kevin Klues commented on MESOS-6780: I didn't get a chance to look at this today and am out for vacation for the next few weeks. [~anandmazumdar] or [~vinodkone] could you take a look? > ContentType/AgentAPIStreamTest.AttachContainerInput test fails reliably > --- > > Key: MESOS-6780 > URL: https://issues.apache.org/jira/browse/MESOS-6780 > Project: Mesos > Issue Type: Bug > Environment: Mac OS 10.12, clang version 4.0.0 > (http://llvm.org/git/clang 88800602c0baafb8739cb838c2fa3f5fb6cc6968) > (http://llvm.org/git/llvm 25801f0f22e178343ee1eadfb4c6cc058628280e), > libc++-513447dbb91dd555ea08297dbee6a1ceb6abdc46 >Reporter: Benjamin Bannier > Attachments: attach_container_input_no_ssl.log > > > The test {{ContentType/AgentAPIStreamTest.AttachContainerInput}} (both {{/0}} > and {{/1}}) fail consistently for me in an SSL-enabled, optimized build. > {code} > [==] Running 1 test from 1 test case. > [--] Global test environment set-up. > [--] 1 test from ContentType/AgentAPIStreamingTest > [ RUN ] ContentType/AgentAPIStreamingTest.AttachContainerInput/0 > I1212 17:11:12.371175 3971208128 cluster.cpp:160] Creating default 'local' > authorizer > I1212 17:11:12.393844 17362944 master.cpp:380] Master > c752777c-d947-4a86-b382-643463866472 (172.18.8.114) started on > 172.18.8.114:51059 > I1212 17:11:12.393899 17362944 master.cpp:382] Flags at startup: --acls="" > --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" > --allocation_interval="1secs" --allocator="HierarchicalDRF" > --authenticate_agents="true" --authenticate_frameworks="true" > --authenticate_http_frameworks="true" --authenticate_http_readonly="true" > --authenticate_http_readwrite="true" --authenticators="crammd5" > --authorizers="local" > --credentials="/private/var/folders/6t/yp_xgc8d6k32rpp0bsbfqm9mgp/T/F46yYV/credentials" > --framework_sorter="drf" --help="false" --hostname_lookup="true" > --http_authenticators="basic" --http_framework_authenticators="basic" > --initialize_driver_logging="true" --log_auto_initialize="true" > --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" > --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" > --quiet="false" --recovery_agent_removal_limit="100%" --registry="in_memory" > --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" > --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" > --registry_store_timeout="100secs" --registry_strict="false" > --root_submissions="true" --user_sorter="drf" --version="false" > --webui_dir="/usr/local/share/mesos/webui" > --work_dir="/private/var/folders/6t/yp_xgc8d6k32rpp0bsbfqm9mgp/T/F46yYV/master" > --zk_session_timeout="10secs" > I1212 17:11:12.394670 17362944 master.cpp:432] Master only allowing > authenticated frameworks to register > I1212 17:11:12.394682 17362944 master.cpp:446] Master only allowing > authenticated agents to register > I1212 17:11:12.394691 17362944 master.cpp:459] Master only allowing > authenticated HTTP frameworks to register > I1212 17:11:12.394701 17362944 credentials.hpp:37] Loading credentials for > authentication from > '/private/var/folders/6t/yp_xgc8d6k32rpp0bsbfqm9mgp/T/F46yYV/credentials' > I1212 17:11:12.394959 17362944 master.cpp:504] Using default 'crammd5' > authenticator > I1212 17:11:12.394996 17362944 authenticator.cpp:519] Initializing server SASL > I1212 17:11:12.411406 17362944 http.cpp:922] Using default 'basic' HTTP > authenticator for realm 'mesos-master-readonly' > I1212 17:11:12.411571 17362944 http.cpp:922] Using default 'basic' HTTP > authenticator for realm 'mesos-master-readwrite' > I1212 17:11:12.411682 17362944 http.cpp:922] Using default 'basic' HTTP > authenticator for realm 'mesos-master-scheduler' > I1212 17:11:12.411775 17362944 master.cpp:584] Authorization enabled > I1212 17:11:12.413318 16289792 master.cpp:2045] Elected as the leading master! > I1212 17:11:12.413377 16289792 master.cpp:1568] Recovering from registrar > I1212 17:11:12.417582 14143488 registrar.cpp:362] Successfully fetched the > registry (0B) in 4.131072ms > I1212 17:11:12.417667 14143488 registrar.cpp:461] Applied 1 operations in > 27us; attempting to update the registry > I1212 17:11:12.421799 14143488 registrar.cpp:506] Successfully updated the > registry in 4.10496ms > I1212 17:11:12.421835 14143488 registrar.cpp:392] Successfully recovered > registrar > I1212 17:11:12.421998 17362944 master.cpp:1684] Recovered 0 agents from the > registry (136B); allowing 10mins for agents to re-register > I1212 17:11:12.422780 3971208128 containerizer.cpp:220]
[jira] [Comment Edited] (MESOS-6807) Design mapping between `TaskInfo` and Job Objects
[ https://issues.apache.org/jira/browse/MESOS-6807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15755869#comment-15755869 ] Andrew Schwartzmeyer edited comment on MESOS-6807 at 12/17/16 12:14 AM: This is the investigation work for the eventual implementation in Mesos-6690. was (Author: andschwa): This is the investigation work for the eventual implementation in the linked issue. > Design mapping between `TaskInfo` and Job Objects > - > > Key: MESOS-6807 > URL: https://issues.apache.org/jira/browse/MESOS-6807 > Project: Mesos > Issue Type: Bug > Components: agent > Environment: Windows Server 2016 >Reporter: Andrew Schwartzmeyer >Assignee: Andrew Schwartzmeyer > Labels: microsoft, windows > Original Estimate: 168h > Remaining Estimate: 168h > > This issue starts tracking the work of correctly mapping Mesos's `TaskInfo` > APIs (as in resource usage limits of particular tasks scheduled on an agent) > to [Windows' Job > Objects|https://msdn.microsoft.com/en-us/library/windows/desktop/ms684161(v=vs.85).aspx], > which are akin to Linux's `cgroup`s. > Initial time estimate is for the investigation, not implementation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6807) Design mapping between `TaskInfo` and Job Objects
[ https://issues.apache.org/jira/browse/MESOS-6807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15755869#comment-15755869 ] Andrew Schwartzmeyer commented on MESOS-6807: - This is the investigation work for the eventual implementation in the linked issue. > Design mapping between `TaskInfo` and Job Objects > - > > Key: MESOS-6807 > URL: https://issues.apache.org/jira/browse/MESOS-6807 > Project: Mesos > Issue Type: Bug > Components: agent > Environment: Windows Server 2016 >Reporter: Andrew Schwartzmeyer >Assignee: Andrew Schwartzmeyer > Labels: microsoft, windows > Original Estimate: 168h > Remaining Estimate: 168h > > This issue starts tracking the work of correctly mapping Mesos's `TaskInfo` > APIs (as in resource usage limits of particular tasks scheduled on an agent) > to [Windows' Job > Objects|https://msdn.microsoft.com/en-us/library/windows/desktop/ms684161(v=vs.85).aspx], > which are akin to Linux's `cgroup`s. > Initial time estimate is for the investigation, not implementation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-6809) libprocess's namespace "unix" conflicts with some preprocessor directives in gcc compiler
Igor Morozov created MESOS-6809: --- Summary: libprocess's namespace "unix" conflicts with some preprocessor directives in gcc compiler Key: MESOS-6809 URL: https://issues.apache.org/jira/browse/MESOS-6809 Project: Mesos Issue Type: Bug Components: libprocess Affects Versions: 1.2.0 Reporter: Igor Morozov Priority: Minor libprocess uses namespace unix like this for example in address.hpp: #ifndef __WINDOWS__ namespace unix { class Address; } // namespace unix { #endif // __WINDOWS__ GCC defines preprocessor directives with the same name: gcc -dM -E -std=gnu++11 - < /dev/null | grep unix #define __unix__ 1 #define __unix 1 #define unix 1 that is causing namespace conflicts and compilation error. g++ --version g++ (Debian 4.9.2-10) 4.9.2 Copyright (C) 2014 Free Software Foundation, Inc. the fix is to change -std=gnu++11 to -std=c++11 whenever possible -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-6808) Refactor Docker::run to only take docker cli parameters
Zhitao Li created MESOS-6808: Summary: Refactor Docker::run to only take docker cli parameters Key: MESOS-6808 URL: https://issues.apache.org/jira/browse/MESOS-6808 Project: Mesos Issue Type: Task Components: docker Reporter: Zhitao Li Assignee: Zhitao Li Priority: Minor As we discussed, {{Docker::run}} in src/docker/docker.hpp should only understand docker cli options. The logic of creating these options should be refactored to another helper function. This will also allow us to overcome the maximum 10 argument limit of GMOCK. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-1280) Add replace task primitive
[ https://issues.apache.org/jira/browse/MESOS-1280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15755230#comment-15755230 ] Zhitao Li commented on MESOS-1280: -- Hi, is there any common interest in pursuing this in the next 1-2 Mesos release cycles? Our organization is quite interested in adding this capability for a couple of reasons, and would be happy if some committer is willing to shepherd us. Thanks! > Add replace task primitive > -- > > Key: MESOS-1280 > URL: https://issues.apache.org/jira/browse/MESOS-1280 > Project: Mesos > Issue Type: Bug > Components: agent, c++ api, master >Reporter: Niklas Quarfot Nielsen > Labels: mesosphere > > Also along the lines of MESOS-938, replaceTask would one of a couple of > primitives needed to support various task replacement and scaling scenarios. > This replaceTask() version is significantly simpler than the first proposed > one; it's only responsibility is to run a new task info on a running tasks > resources. > The running task will be killed as usual, but the newly freed resources will > never be announced and the new task will run on them instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-6807) Design mapping between `TaskInfo` and Job Objects
Andrew Schwartzmeyer created MESOS-6807: --- Summary: Design mapping between `TaskInfo` and Job Objects Key: MESOS-6807 URL: https://issues.apache.org/jira/browse/MESOS-6807 Project: Mesos Issue Type: Bug Components: agent Environment: Windows Server 2016 Reporter: Andrew Schwartzmeyer Assignee: Andrew Schwartzmeyer This issue starts tracking the work of correctly mapping Mesos's `TaskInfo` APIs (as in resource usage limits of particular tasks scheduled on an agent) to [Windows' Job Objects|https://msdn.microsoft.com/en-us/library/windows/desktop/ms684161(v=vs.85).aspx], which are akin to Linux's `cgroup`s. Initial time estimate is for the investigation, not implementation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6784) IOSwitchboardTest.KillSwitchboardContainerDestroyed is flaky
[ https://issues.apache.org/jira/browse/MESOS-6784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15755184#comment-15755184 ] Anand Mazumdar commented on MESOS-6784: --- Committed a fix for the second log snippet that Jie posted around the test bug. {noformat} commit 28eaa8df7c95130b0c244f7613ad506be899cafd Author: Anand MazumdarDate: Wed Dec 14 17:40:47 2016 -0800 Fixed the 'IOSwitchboardTest.KillSwitchboardContainerDestroyed' test. The container was launched with TTY enabled. This meant that killing the switchboard would trigger the task to terminate on its own owing to the "master" end of the TTY dying. This would make it not go through the code path of the isolator failing due to resource limit issue. Review: https://reviews.apache.org/r/54770 {noformat} The original log in the issue description is a separate issue in the switchboard code itself and I am working on that. This should make the CI green for now. > IOSwitchboardTest.KillSwitchboardContainerDestroyed is flaky > > > Key: MESOS-6784 > URL: https://issues.apache.org/jira/browse/MESOS-6784 > Project: Mesos > Issue Type: Bug > Components: agent >Reporter: Neil Conway >Assignee: Anand Mazumdar >Priority: Blocker > Labels: mesosphere > > {noformat} > [ RUN ] IOSwitchboardTest.KillSwitchboardContainerDestroyed > I1212 13:57:02.641043 2211 containerizer.cpp:220] Using isolation: > posix/cpu,filesystem/posix,network/cni > W1212 13:57:02.641438 2211 backend.cpp:76] Failed to create 'overlay' > backend: OverlayBackend requires root privileges, but is running as user nrc > W1212 13:57:02.641559 2211 backend.cpp:76] Failed to create 'bind' backend: > BindBackend requires root privileges > I1212 13:57:02.642822 2268 containerizer.cpp:594] Recovering containerizer > I1212 13:57:02.643975 2253 provisioner.cpp:253] Provisioner recovery complete > I1212 13:57:02.644953 2255 containerizer.cpp:986] Starting container > 09e87380-00ab-4987-83c9-fa1c5d86717f for executor 'executor' of framework > I1212 13:57:02.647004 2245 switchboard.cpp:430] Allocated pseudo terminal > '/dev/pts/54' for container 09e87380-00ab-4987-83c9-fa1c5d86717f > I1212 13:57:02.652305 2245 switchboard.cpp:596] Created I/O switchboard > server (pid: 2705) listening on socket file > '/tmp/mesos-io-switchboard-b4af1c92-6633-44f3-9d35-e0e36edaf70a' for > container 09e87380-00ab-4987-83c9-fa1c5d86717f > I1212 13:57:02.655513 2267 launcher.cpp:133] Forked child with pid '2706' > for container '09e87380-00ab-4987-83c9-fa1c5d86717f' > I1212 13:57:02.655732 2267 containerizer.cpp:1621] Checkpointing container's > forked pid 2706 to > '/tmp/IOSwitchboardTest_KillSwitchboardContainerDestroyed_Me5CRx/meta/slaves/frameworks/executors/executor/runs/09e87380-00ab-4987-83c9-fa1c5d86717f/pids/forked.pid' > I1212 13:57:02.726306 2265 containerizer.cpp:2463] Container > 09e87380-00ab-4987-83c9-fa1c5d86717f has exited > I1212 13:57:02.726352 2265 containerizer.cpp:2100] Destroying container > 09e87380-00ab-4987-83c9-fa1c5d86717f in RUNNING state > E1212 13:57:02.726495 2243 switchboard.cpp:861] Unexpected termination of > I/O switchboard server: 'IOSwitchboard' exited with signal: Killed for > container 09e87380-00ab-4987-83c9-fa1c5d86717f > I1212 13:57:02.726563 2265 launcher.cpp:149] Asked to destroy container > 09e87380-00ab-4987-83c9-fa1c5d86717f > E1212 13:57:02.783607 2228 switchboard.cpp:799] Failed to remove unix domain > socket file '/tmp/mesos-io-switchboard-b4af1c92-6633-44f3-9d35-e0e36edaf70a' > for container '09e87380-00ab-4987-83c9-fa1c5d86717f': No such file or > directory > ../../mesos/src/tests/containerizer/io_switchboard_tests.cpp:661: Failure > Value of: wait.get()->reasons().size() == 1 > Actual: false > Expected: true > *** Aborted at 1481579822 (unix time) try "date -d @1481579822" if you are > using GNU date *** > PC: @ 0x1bf16d0 testing::UnitTest::AddTestPartResult() > *** SIGSEGV (@0x0) received by PID 2211 (TID 0x7faed7d078c0) from PID 0; > stack trace: *** > @ 0x7faecf855100 (unknown) > @ 0x1bf16d0 testing::UnitTest::AddTestPartResult() > @ 0x1be6247 testing::internal::AssertHelper::operator=() > @ 0x19ed751 > mesos::internal::tests::IOSwitchboardTest_KillSwitchboardContainerDestroyed_Test::TestBody() > @ 0x1c0ed8c > testing::internal::HandleSehExceptionsInMethodIfSupported<>() > @ 0x1c09e74 > testing::internal::HandleExceptionsInMethodIfSupported<>() > @ 0x1beb505 testing::Test::Run() > @ 0x1bebc88 testing::TestInfo::Run() > @ 0x1bec2ce testing::TestCase::Run() > @ 0x1bf2ba8