Hi All, It's been quite some time since I've posted here and that's chiefly because up until a day or two ago, things were working really well.
I actually may have posted about this some time back. But then the problem seemed more intermittent. In summa, several "docker stops" don't work, i.e., the containers are not stopped. Deployment: one Ubuntu VM (vmWare) LTS 14.04 with kernel 3.19 Zookeeper Mesos-master (0.23.0) Mesos-slave (0.23.0) Marathon (0.10.0) Docker 1.9.1 Weave 1.1.0 Our application contains which include MongoDB (4) PostGres ECX (our product) The only thing that's changed at all in the config above is the version of Docker. Used to be 1.6.2 but I today upgraded it hoping to solve the problem. My automater program stops the application by sending Marathon an "http delete" for each running up. Every now & then (reliably reproducible today) not all containers get stopped. Most recently, 3 containers failed to stop. Here are the attendant phenomena: Marathon shows the 3 applications in deployment mode (presumably "deployment" in the sense of "stopping") *ps output:* root@71:~# ps -ef | grep docker root 3823 1 0 13:55 ? 00:00:02 /usr/bin/docker daemon -H unix:///var/run/docker.sock -H tcp://0.0.0.0:4243 root 4967 1 0 13:57 ? 00:00:01 /usr/sbin/mesos-slave --master=zk://71.100.202.99:2181/mesos --log_dir=/var/log/mesos --containerizers=docker,mesos --docker=/usr/local/ecxmcc/weaveShim --docker_stop_timeout=15secs --executor_registration_timeout=5mins --hostname=71.100.202.99 --ip=71.100.202.99 --attributes=hostType:ecx,shard1 --resources=ports:[31000-31999,8443-8443] root 5263 3823 0 13:57 ? 00:00:00 docker-proxy -proto tcp -host-ip 0.0.0.0 -host-port 6783 -container-ip 172.17.0.2 -container-port 6783 root 5271 3823 0 13:57 ? 00:00:00 docker-proxy -proto udp -host-ip 0.0.0.0 -host-port 6783 -container-ip 172.17.0.2 -container-port 6783 root 5279 3823 0 13:57 ? 00:00:00 docker-proxy -proto tcp -host-ip 172.17.0.1 -host-port 53 -container-ip 172.17.0.2 -container-port 53 root 5287 3823 0 13:57 ? 00:00:00 docker-proxy -proto udp -host-ip 172.17.0.1 -host-port 53 -container-ip 172.17.0.2 -container-port 53 root 7119 4967 0 14:00 ? 00:00:01 mesos-docker-executor --container=mesos-20160114-135722-1674208327-5050-4917-S0.bfc5a419-30f8-43f7-af2f-5582394532f2 --docker=/usr/local/ecxmcc/weaveShim --help=false --mapped_directory=/mnt/mesos/sandbox --sandbox_directory=/tmp/mesos/slaves/20160114-135722-1674208327-5050-4917-S0/frameworks/20160114-103414-1674208327-5050-3293-0000/executors/ecxconfigdb.1e6e0779-baf1-11e5-8c36-522bd4cc5ea9/runs/bfc5a419-30f8-43f7-af2f-5582394532f2 --stop_timeout=15secs root 7378 4967 0 14:00 ? 00:00:01 mesos-docker-executor --container=mesos-20160114-135722-1674208327-5050-4917-S0.9b700cdc-3d29-49b7-a7fc-e543a91f7b89 --docker=/usr/local/ecxmcc/weaveShim --help=false --mapped_directory=/mnt/mesos/sandbox --sandbox_directory=/tmp/mesos/slaves/20160114-135722-1674208327-5050-4917-S0/frameworks/20160114-103414-1674208327-5050-3293-0000/executors/ecxcatalogdbs1.25911dda-baf1-11e5-8c36-522bd4cc5ea9/runs/9b700cdc-3d29-49b7-a7fc-e543a91f7b89 --stop_timeout=15secs root 7640 4967 0 14:01 ? 00:00:01 mesos-docker-executor --container=mesos-20160114-135722-1674208327-5050-4917-S0.d7d861d3-cfc9-424d-b341-0631edea4298 --docker=/usr/local/ecxmcc/weaveShim --help=false --mapped_directory=/mnt/mesos/sandbox --sandbox_directory=/tmp/mesos/slaves/20160114-135722-1674208327-5050-4917-S0/frameworks/20160114-103414-1674208327-5050-3293-0000/executors/mongoconfig.2cb9163b-baf1-11e5-8c36-522bd4cc5ea9/runs/d7d861d3-cfc9-424d-b341-0631edea4298 --stop_timeout=15secs *root 9696 9695 0 14:06 ? 00:00:00 /usr/bin/docker stop -t 15 mesos-20160114-135722-1674208327-5050-4917-S0.d7d861d3-cfc9-424d-b341-0631edea4298* *root 9709 9708 0 14:06 ? 00:00:00 /usr/bin/docker stop -t 15 mesos-20160114-135722-1674208327-5050-4917-S0.9b700cdc-3d29-49b7-a7fc-e543a91f7b89* *root 9720 9719 0 14:06 ? 00:00:00 /usr/bin/docker stop -t 15 mesos-20160114-135722-1674208327-5050-4917-S0.bfc5a419-30f8-43f7-af2f-5582394532f2* *docker ps output:* root@71:~# docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 5abafbfe7de2 mongo:2.6.8 "/w/w /entrypoint.sh " 11 minutes ago Up 11 minutes 27017/tcp mesos-20160114-135722-1674208327-5050-4917-S0.d7d861d3-cfc9-424d-b341-0631edea4298 a8449682ca2e mongo:2.6.8 "/w/w /entrypoint.sh " 11 minutes ago Up 11 minutes 27017/tcp mesos-20160114-135722-1674208327-5050-4917-S0.9b700cdc-3d29-49b7-a7fc-e543a91f7b89 3b956457374b mongo:2.6.8 "/w/w /entrypoint.sh " 11 minutes ago Up 11 minutes 27017/tcp mesos-20160114-135722-1674208327-5050-4917-S0.bfc5a419-30f8-43f7-af2f-5582394532f2 4c1588bb3d4b weaveworks/weaveexec:v1.1.0 "/home/weave/weavepro" 15 minutes ago Up 15 minutes weaveproxy a26a0363584b weaveworks/weave:v1.1.0 "/home/weave/weaver -" 15 minutes ago Up 15 minutes 172.17.0.1:53->53/tcp, 172.17.0.1:53->53/udp, 0.0.0.0:6783->6783/tcp, 0.0.0.0:6783->6783/udp weave *from /var/log/syslog:* Jan 14 14:10:02 71 mesos-master[4917]: I0114 14:10:02.356405 5002 master.cpp:2944] Asked to kill task mongoconfig.2cb9163b-baf1-11e5-8c36-522bd4cc5ea9 of framework 20160114-103414-1674208327-5050-3293-0000 *Jan 14 14:10:02 71 mesos-master[4917]: I0114 14:10:02.356459 5002 master.cpp:3034] Telling slave 20160114-135722-1674208327-5050-4917-S0 at slave(1)@71.100.202.99:5051 <http://71.100.202.99:5051> (71.100.202.99) to kill task mongoconfig.2cb9163b-baf1-11e5-8c36-522bd4cc5ea9 of framework 20160114-103414-1674208327-5050-3293-0000 (marathon) at scheduler-2a46bd53-a6fd-4202-b271-d633064acf7f@71.100.202.99:46167 <http://scheduler-2a46bd53-a6fd-4202-b271-d633064acf7f@71.100.202.99:46167>* *Jan 14 14:10:02 71 mesos-slave[4967]: I0114 14:10:02.356729 5042 slave.cpp:1755] Asked to kill task mongoconfig.2cb9163b-baf1-11e5-8c36-522bd4cc5ea9 of framework 20160114-103414-1674208327-5050-3293-0000* Jan 14 14:10:02 71 mesos-master[4917]: I0114 14:10:02.378295 5004 http.cpp:283] HTTP GET for /master/state.json from 172.19.15.61:65038 with User-Agent='Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36' Jan 14 14:10:02 71 mesos-master[4917]: I0114 14:10:02.425904 5001 master.cpp:2944] Asked to kill task ecxcatalogdbs1.25911dda-baf1-11e5-8c36-522bd4cc5ea9 of framework 20160114-103414-1674208327-5050-3293-0000 Jan 14 14:10:02 71 mesos-master[4917]: I0114 14:10:02.425935 5001 master.cpp:3034] Telling slave 20160114-135722-1674208327-5050-4917-S0 at slave(1)@71.100.202.99:5051 (71.100.202.99) to kill task ecxcatalogdbs1.25911dda-baf1-11e5-8c36-522bd4cc5ea9 of framework 20160114-103414-1674208327-5050-3293-0000 (marathon) at scheduler-2a46bd53-a6fd-4202-b271-d633064acf7f@71.100.202.99:46167 Jan 14 14:10:02 71 mesos-slave[4967]: I0114 14:10:02.426136 5041 slave.cpp:1755] Asked to kill task ecxcatalogdbs1.25911dda-baf1-11e5-8c36-522bd4cc5ea9 of framework 20160114-103414-1674208327-5050-3293-0000 Jan 14 14:10:02 71 mesos-master[4917]: I0114 14:10:02.435932 4998 master.cpp:2944] Asked to kill task ecxconfigdb.1e6e0779-baf1-11e5-8c36-522bd4cc5ea9 of framework 20160114-103414-1674208327-5050-3293-0000 Jan 14 14:10:02 71 mesos-master[4917]: I0114 14:10:02.435958 4998 master.cpp:3034] Telling slave 20160114-135722-1674208327-5050-4917-S0 at slave(1)@71.100.202.99:5051 (71.100.202.99) to kill task ecxconfigdb.1e6e0779-baf1-11e5-8c36-522bd4cc5ea9 of framework 20160114-103414-1674208327-5050-3293-0000 (marathon) at scheduler-2a46bd53-a6fd-4202-b271-d633064acf7f@71.100.202.99:46167 Jan 14 14:10:02 71 mesos-slave[4967]: I0114 14:10:02.436151 5038 slave.cpp:1755] Asked to kill task ecxconfigdb.1e6e0779-baf1-11e5-8c36-522bd4cc5ea9 of framework 20160114-103414-1674208327-5050-3293-0000 Jan 14 14:10:03 71 mesos-master[4917]: I0114 14:10:03.759009 5001 master.cpp:4290] Sending 1 offers to framework 20160114-103414-1674208327-5050-3293-0000 (marathon) at scheduler-2a46bd53-a6fd-4202-b271-d633064acf7f@71.100.202.99:46167 Jan 14 14:10:03 71 marathon[4937]: [2016-01-14 14:10:03,765] INFO started processing 1 offers, launching at most 1 tasks per offer and 1000 tasks in total (mesosphere.marathon.tasks.IterativeOfferMatcher$:132) Jan 14 14:10:03 71 marathon[4937]: [2016-01-14 14:10:03,766] INFO Offer [20160114-135722-1674208327-5050-4917-O128]. Decline with default filter refuseSeconds (use --decline_offer_duration to configure) (mesosphere.marathon.tasks.IterativeOfferMatcher$:231) *from Mesos STDOUT of unstopped container:* Starting task mongoconfig.2cb9163b-baf1-11e5-8c36-522bd4cc5ea9 2016-01-14T19:01:10.997+0000 [initandlisten] MongoDB starting : pid=1 port=27019 dbpath=/data/db/config master=1 64-bit host=mongoconfig 2016-01-14T19:01:10.998+0000 [initandlisten] db version v2.6.8 2016-01-14T19:01:10.998+0000 [initandlisten] git version: 3abc04d6d4f71de00b57378e3277def8fd7a6700 2016-01-14T19:01:10.998+0000 [initandlisten] build info: Linux build5.nj1.10gen.cc 2.6.32-431.3.1.el6.x86_64 #1 SMP Fri Jan 3 21:39:27 UTC 2014 x86_64 BOOST_LIB_VERSION=1_49 2016-01-14T19:01:10.998+0000 [initandlisten] allocator: tcmalloc 2016-01-14T19:01:10.998+0000 [initandlisten] options: { sharding: { clusterRole: "configsvr" }, storage: { dbPath: "/data/db/config", journal: { enabled: true } } } 2016-01-14T19:01:10.999+0000 [initandlisten] journal dir=/data/db/config/journal 2016-01-14T19:01:11.000+0000 [initandlisten] recover : no journal files present, no recovery needed 2016-01-14T19:01:11.429+0000 [initandlisten] warning: ClientCursor::staticYield can't unlock b/c of recursive lock ns: top: { opid: 11, active: true, secs_running: 0, microsecs_running: 36, op: "query", ns: "local.oplog.$main", query: { query: {}, orderby: { $natural: -1 } }, client: "0.0.0.0:0", desc: "initandlisten", threadId: "0x7f8f73075b40", locks: { ^: "W" }, waitingForLock: false, numYields: 0, lockStats: { timeLockedMicros: {}, timeAcquiringMicros: {} } } 2016-01-14T19:01:11.429+0000 [initandlisten] waiting for connections on port 27019 2016-01-14T19:01:17.405+0000 [initandlisten] connection accepted from 10.2.0.3:51189 #1 (1 connection now open) 2016-01-14T19:01:17.413+0000 [initandlisten] connection accepted from 10.2.0.3:51190 #2 (2 connections now open) 2016-01-14T19:01:17.413+0000 [initandlisten] connection accepted from 10.2.0.3:51191 #3 (3 connections now open) 2016-01-14T19:01:17.414+0000 [conn3] first cluster operation detected, adding sharding hook to enable versioning and authentication to remote servers 2016-01-14T19:01:17.414+0000 [conn3] CMD fsync: sync:1 lock:0 2016-01-14T19:01:17.415+0000 [conn3] CMD fsync: sync:1 lock:0 2016-01-14T19:01:17.415+0000 [conn3] CMD fsync: sync:1 lock:0 2016-01-14T19:01:17.415+0000 [conn3] CMD fsync: sync:1 lock:0 2016-01-14T19:01:17.416+0000 [conn3] CMD fsync: sync:1 lock:0 2016-01-14T19:01:17.416+0000 [conn3] CMD fsync: sync:1 lock:0 2016-01-14T19:01:17.416+0000 [conn3] CMD fsync: sync:1 lock:0 2016-01-14T19:01:17.419+0000 [initandlisten] connection accepted from 10.2.0.3:51193 #4 (4 connections now open) 2016-01-14T19:01:17.420+0000 [initandlisten] connection accepted from 10.2.0.3:51194 #5 (5 connections now open) 2016-01-14T19:01:17.442+0000 [conn1] end connection 10.2.0.3:51189 (4 connections now open) 2016-01-14T19:02:11.285+0000 [clientcursormon] mem (MB) res:59 virt:385 2016-01-14T19:02:11.285+0000 [clientcursormon] mapped (incl journal view):192 2016-01-14T19:02:11.285+0000 [clientcursormon] connections:4 2016-01-14T19:03:11.293+0000 [clientcursormon] mem (MB) res:72 virt:385 2016-01-14T19:03:11.294+0000 [clientcursormon] mapped (incl journal view):192 2016-01-14T19:03:11.294+0000 [clientcursormon] connections:4 Killing docker task Shutting down Killing docker task Shutting down Killing docker task Shutting down Killing docker task Shutting down Killing docker task Shutting down Killing docker task Shutting down Killing docker task Shutting down Killing docker task Shutting down Killing docker task Shutting down Killing docker task Shutting down Killing docker task Shutting down Killing docker task Shutting down Killing docker task Shutting down Killing docker task Shutting down Killing docker task Shutting down Killing docker task Most disturbing in all of this is that while I can stop the deployments in Marathon (which properly end the "docker stop" commands visible in ps output), I can not bounce docker, not by Upstart, nor via kill command). Ultimately, I have to reboot the VM. FWIW, the 3 mongod containers (apparently stuck in their Killing docker task/shutting down loop) are running at 100%CPU as evinced by both "docker stats" and "top". I would truly be grateful for some guidance on this - even a mere work-around would be appreciated. Thank you. -Paul