Tim, I've tracked down the cause of this problem: it's the result of some kind of incompatibility between kernel 3.19 and "VMware Tools". I know little more than that.
I installed VMware Tools via *apt-get install open-vm-tools-lts-trusty*. Everything worked fine on 3.13. But when I upgrade to 3.19, the error occurs quite reliably. Revert back to 3.13 and the error goes away. I looked high & low for some statement of kernel requirements for VMware Tools, but can find none. Sorry to have wasted your time. -Paul On Fri, Jan 15, 2016 at 9:19 AM, Paul Bell <arach...@gmail.com> wrote: > In chasing down this problem, I stumbled upon something of moment: the > problem does NOT seem to happen with kernel 3.13. > > Some weeks back, in the hope of getting past another problem wherein the > root filesystem "becomes" R/O, I upgraded from 3.13 to 3.19 (Ubuntu 14.04 > LTS). The kernel upgrade was done as shown here (there's some extra stuff > to get rid of Ubuntu desktop and liberate some disk space): > > apt-get update > apt-get -y remove unbuntu-desktop > apt-get -y purge lightdm > rm -Rf /var/lib/lightdm-data > apt-get -y remove --purge libreoffice-core > apt-get -y remove --purge libreoffice-common > > echo " Installing new kernel" > > apt-get -y install linux-generic-lts-vivid > apt-get -y autoremove linux-image-3.13.0-32-generic > apt-get -y autoremove linux-image-3.13.0-71-generic > update-grub > reboot > > After the reboot, a "uname -r" shows kernel 3.19.0-42-generic. > > Under this kernel I can now reliably reproduce the failure to stop a > MongoDB container. Specifically, any & all attempts to kill the container, > e.g.,via > > Marathon HTTP Delete (which leads to docker-mesos-executor presenting > "docker stop" command) > Getting inside the running container shell and issuing "kill" or > db.shutDown() > > causes the mongod container > > - to show in its log that it's shutting down normally > - to enter a 100% CPU loop > - to become unkillable (only reboot "fixes" things) > > Note finally that my conclusion about kernel 3.13 "working" is at present > a weak induction. But I do know that when I reverted to that kernel I > could, at least once, stop the containers w/o any problems; whereas at 3.19 > I can reliably reproduce the problem. I will try to make this induction > stronger as the day wears on. > > Did I do something "wrong" in my kernel upgrade steps? > > Is anyone aware of such an issue in 3.19 or of work done post-3.13 in the > area of task termination & signal handling? > > Thanks for your help. > > -Paul > > > On Thu, Jan 14, 2016 at 5:14 PM, Paul Bell <arach...@gmail.com> wrote: > >> I spoke to soon, I'm afraid. >> >> Next time I did the stop (with zero timeout), I see the same phenomenon: >> a mongo container showing repeated: >> >> killing docker task >> shutting down >> >> >> What else can I try? >> >> Thank you. >> >> On Thu, Jan 14, 2016 at 5:07 PM, Paul Bell <arach...@gmail.com> wrote: >> >>> Hi Tim, >>> >>> I set docker_stop_timeout to zero as you asked. I am pleased to report >>> (though a bit fearful about being pleased) that this change seems to have >>> shut everyone down pretty much instantly. >>> >>> Can you explain what's happening, e.g., does docker_stop_timeout=0 >>> cause the immediate use of "kill -9" as opposed to "kill -2"? >>> >>> I will keep testing the behavior. >>> >>> Thank you. >>> >>> -Paul >>> >>> On Thu, Jan 14, 2016 at 3:59 PM, Paul Bell <arach...@gmail.com> wrote: >>> >>>> Hi Tim, >>>> >>>> Things have gotten slightly odder (if that's possible). When I now >>>> start the application 5 or so containers, only one "ecxconfigdb" gets >>>> started - and even he took a few tries. That is, I see him failing, moving >>>> to deploying, then starting again. But I've no evidence (no STDOUT, and no >>>> docker ctr logs) that show why. >>>> >>>> In any event, ecxconfigdb does start. Happily, when I try to stop the >>>> application I am seeing the phenomena I posted before: killing docker task, >>>> shutting down repeated many times. The UN-stopped container is now running >>>> at 100% CPU. >>>> >>>> I will try modifying docker_stop_timeout. Back shortly >>>> >>>> Thanks again. >>>> >>>> -Paul >>>> >>>> PS: what do you make of the "broken pipe" error in the docker.log? >>>> >>>> *from /var/log/upstart/docker.log* >>>> >>>> [34mINFO [0m[3054] GET /v1.15/images/mongo:2.6.8/json >>>> [34mINFO [0m[3054] GET >>>> /v1.21/images/mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b/json >>>> [31mERRO [0m[3054] Handler for GET >>>> /v1.21/images/mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b/json >>>> returned error: No such image: >>>> mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b >>>> [31mERRO [0m[3054] HTTP Error >>>> [31merr [0m=No such image: >>>> mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b >>>> [31mstatusCode [0m=404 >>>> [34mINFO [0m[3054] GET /v1.15/containers/weave/json >>>> [34mINFO [0m[3054] POST >>>> /v1.21/containers/create?name=mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b >>>> [34mINFO [0m[3054] POST >>>> /v1.21/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/attach?stderr=1&stdout=1&stream=1 >>>> [34mINFO [0m[3054] POST >>>> /v1.21/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/start >>>> [34mINFO [0m[3054] GET >>>> /v1.15/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/json >>>> [34mINFO [0m[3054] GET >>>> /v1.15/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/json >>>> [34mINFO [0m[3054] GET /v1.15/containers/weave/json >>>> [34mINFO [0m[3054] GET >>>> /v1.15/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/json >>>> [34mINFO [0m[3054] GET >>>> /v1.15/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/json >>>> [34mINFO [0m[3054] GET /v1.15/containers/weave/json >>>> [34mINFO [0m[3054] GET >>>> /v1.15/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/json >>>> [34mINFO [0m[3054] GET >>>> /v1.21/containers/mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b/json >>>> [34mINFO [0m[3111] GET /v1.21/containers/json >>>> [34mINFO [0m[3120] GET /v1.21/containers/cf7/json >>>> [34mINFO [0m[3120] GET >>>> /v1.21/containers/cf7/logs?stderr=1&stdout=1&tail=all >>>> [34mINFO [0m[3153] GET /containers/json >>>> [34mINFO [0m[3153] GET >>>> /containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/json >>>> [34mINFO [0m[3153] GET >>>> /containers/56111722ef83134f6c73c5e3aa27de3f34f1fa73efdec3257c3cc9b283e40729/json >>>> [34mINFO [0m[3153] GET >>>> /containers/b9e9b79a8d431455bfcaafca59223017b2470a47a294075d656eeffdaaefad33/json >>>> [34mINFO [0m[3175] GET /containers/json >>>> [34mINFO [0m[3175] GET >>>> /containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/json >>>> [34mINFO [0m[3175] GET >>>> /containers/56111722ef83134f6c73c5e3aa27de3f34f1fa73efdec3257c3cc9b283e40729/json >>>> [34mINFO [0m[3175] GET >>>> /containers/b9e9b79a8d431455bfcaafca59223017b2470a47a294075d656eeffdaaefad33/json >>>> * [34mINFO [0m[3175] POST >>>> /v1.21/containers/mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b/stop* >>>> ?t=15 >>>> * [31mERRO [0m[3175] attach: stdout: write unix @: broken pipe * >>>> * [34mINFO [0m[3190] Container >>>> cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47 failed to >>>> exit within 15 seconds of SIGTERM - using the force * >>>> * [34mINFO [0m[3200] Container cf7fc7c48324 failed to exit within 10 >>>> seconds of kill - trying direct SIGKILL * >>>> >>>> *STDOUT from Mesos:* >>>> >>>> *--container="mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b" >>>> *--docker="/usr/local/ecxmcc/weaveShim" --help="false" >>>> --initialize_driver_logging="true" --logbufsecs="0" --logging_level="INFO" >>>> --mapped_directory="/mnt/mesos/sandbox" --quiet="false" >>>> --sandbox_directory="/tmp/mesos/slaves/20160114-153418-1674208327-5050-3798-S0/frameworks/20160114-103414-1674208327-5050-3293-0000/executors/ecxconfigdb.c3cae92e-baff-11e5-8afe-82f779ac6285/runs/c5c35d59-1318-4a96-b850-b0b788815f1b" >>>> --stop_timeout="15secs" >>>> --container="mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b" >>>> --docker="/usr/local/ecxmcc/weaveShim" --help="false" >>>> --initialize_driver_logging="true" --logbufsecs="0" --logging_level="INFO" >>>> --mapped_directory="/mnt/mesos/sandbox" --quiet="false" >>>> --sandbox_directory="/tmp/mesos/slaves/20160114-153418-1674208327-5050-3798-S0/frameworks/20160114-103414-1674208327-5050-3293-0000/executors/ecxconfigdb.c3cae92e-baff-11e5-8afe-82f779ac6285/runs/c5c35d59-1318-4a96-b850-b0b788815f1b" >>>> --stop_timeout="15secs" >>>> Registered docker executor on 71.100.202.99 >>>> Starting task ecxconfigdb.c3cae92e-baff-11e5-8afe-82f779ac6285 >>>> 2016-01-14T20:45:38.613+0000 [initandlisten] MongoDB starting : pid=1 >>>> port=27017 dbpath=/data/db 64-bit host=ecxconfigdb >>>> 2016-01-14T20:45:38.614+0000 [initandlisten] db version v2.6.8 >>>> 2016-01-14T20:45:38.614+0000 [initandlisten] git version: >>>> 3abc04d6d4f71de00b57378e3277def8fd7a6700 >>>> 2016-01-14T20:45:38.614+0000 [initandlisten] build info: Linux >>>> build5.nj1.10gen.cc 2.6.32-431.3.1.el6.x86_64 #1 SMP Fri Jan 3 >>>> 21:39:27 UTC 2014 x86_64 BOOST_LIB_VERSION=1_49 >>>> 2016-01-14T20:45:38.614+0000 [initandlisten] allocator: tcmalloc >>>> 2016-01-14T20:45:38.614+0000 [initandlisten] options: { storage: { >>>> journal: { enabled: true } } } >>>> 2016-01-14T20:45:38.616+0000 [initandlisten] journal >>>> dir=/data/db/journal >>>> 2016-01-14T20:45:38.616+0000 [initandlisten] recover : no journal files >>>> present, no recovery needed >>>> 2016-01-14T20:45:39.006+0000 [initandlisten] waiting for connections on >>>> port 27017 >>>> 2016-01-14T20:46:38.975+0000 [clientcursormon] mem (MB) res:77 >>>> virt:12942 >>>> 2016-01-14T20:46:38.975+0000 [clientcursormon] mapped (incl journal >>>> view):12762 >>>> 2016-01-14T20:46:38.975+0000 [clientcursormon] connections:0 >>>> Killing docker task >>>> Shutting down >>>> Killing docker task >>>> Shutting down >>>> Killing docker task >>>> Shutting down >>>> >>>> On Thu, Jan 14, 2016 at 3:38 PM, Paul Bell <arach...@gmail.com> wrote: >>>> >>>>> Hey Tim, >>>>> >>>>> Thank you very much for your reply. >>>>> >>>>> Yes, I am in the midst of trying to reproduce the problem. If >>>>> successful (so to speak), I will do as you ask. >>>>> >>>>> Cordially, >>>>> >>>>> Paul >>>>> >>>>> On Thu, Jan 14, 2016 at 3:19 PM, Tim Chen <t...@mesosphere.io> wrote: >>>>> >>>>>> Hi Paul, >>>>>> >>>>>> Looks like we've already issued the docker stop as you seen in the ps >>>>>> output, but the containers are still running. Can you look at the Docker >>>>>> daemon logs and see what's going on there? >>>>>> >>>>>> And also can you also try to modify docker_stop_timeout to 0 so that >>>>>> we SIGKILL the containers right away, and see if this still happens? >>>>>> >>>>>> Tim >>>>>> >>>>>> >>>>>> >>>>>> On Thu, Jan 14, 2016 at 11:52 AM, Paul Bell <arach...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Hi All, >>>>>>> >>>>>>> It's been quite some time since I've posted here and that's chiefly >>>>>>> because up until a day or two ago, things were working really well. >>>>>>> >>>>>>> I actually may have posted about this some time back. But then the >>>>>>> problem seemed more intermittent. >>>>>>> >>>>>>> In summa, several "docker stops" don't work, i.e., the containers >>>>>>> are not stopped. >>>>>>> >>>>>>> Deployment: >>>>>>> >>>>>>> one Ubuntu VM (vmWare) LTS 14.04 with kernel 3.19 >>>>>>> Zookeeper >>>>>>> Mesos-master (0.23.0) >>>>>>> Mesos-slave (0.23.0) >>>>>>> Marathon (0.10.0) >>>>>>> Docker 1.9.1 >>>>>>> Weave 1.1.0 >>>>>>> Our application contains which include >>>>>>> MongoDB (4) >>>>>>> PostGres >>>>>>> ECX (our product) >>>>>>> >>>>>>> The only thing that's changed at all in the config above is the >>>>>>> version of Docker. Used to be 1.6.2 but I today upgraded it hoping to >>>>>>> solve >>>>>>> the problem. >>>>>>> >>>>>>> >>>>>>> My automater program stops the application by sending Marathon an >>>>>>> "http delete" for each running up. Every now & then (reliably >>>>>>> reproducible >>>>>>> today) not all containers get stopped. Most recently, 3 containers >>>>>>> failed >>>>>>> to stop. >>>>>>> >>>>>>> Here are the attendant phenomena: >>>>>>> >>>>>>> Marathon shows the 3 applications in deployment mode (presumably >>>>>>> "deployment" in the sense of "stopping") >>>>>>> >>>>>>> *ps output:* >>>>>>> >>>>>>> root@71:~# ps -ef | grep docker >>>>>>> root 3823 1 0 13:55 ? 00:00:02 /usr/bin/docker >>>>>>> daemon -H unix:///var/run/docker.sock -H tcp://0.0.0.0:4243 >>>>>>> root 4967 1 0 13:57 ? 00:00:01 >>>>>>> /usr/sbin/mesos-slave --master=zk://71.100.202.99:2181/mesos >>>>>>> --log_dir=/var/log/mesos --containerizers=docker,mesos >>>>>>> --docker=/usr/local/ecxmcc/weaveShim --docker_stop_timeout=15secs >>>>>>> --executor_registration_timeout=5mins --hostname=71.100.202.99 >>>>>>> --ip=71.100.202.99 --attributes=hostType:ecx,shard1 >>>>>>> --resources=ports:[31000-31999,8443-8443] >>>>>>> root 5263 3823 0 13:57 ? 00:00:00 docker-proxy -proto >>>>>>> tcp -host-ip 0.0.0.0 -host-port 6783 -container-ip 172.17.0.2 >>>>>>> -container-port 6783 >>>>>>> root 5271 3823 0 13:57 ? 00:00:00 docker-proxy -proto >>>>>>> udp -host-ip 0.0.0.0 -host-port 6783 -container-ip 172.17.0.2 >>>>>>> -container-port 6783 >>>>>>> root 5279 3823 0 13:57 ? 00:00:00 docker-proxy -proto >>>>>>> tcp -host-ip 172.17.0.1 -host-port 53 -container-ip 172.17.0.2 >>>>>>> -container-port 53 >>>>>>> root 5287 3823 0 13:57 ? 00:00:00 docker-proxy -proto >>>>>>> udp -host-ip 172.17.0.1 -host-port 53 -container-ip 172.17.0.2 >>>>>>> -container-port 53 >>>>>>> root 7119 4967 0 14:00 ? 00:00:01 >>>>>>> mesos-docker-executor >>>>>>> --container=mesos-20160114-135722-1674208327-5050-4917-S0.bfc5a419-30f8-43f7-af2f-5582394532f2 >>>>>>> --docker=/usr/local/ecxmcc/weaveShim --help=false >>>>>>> --mapped_directory=/mnt/mesos/sandbox >>>>>>> --sandbox_directory=/tmp/mesos/slaves/20160114-135722-1674208327-5050-4917-S0/frameworks/20160114-103414-1674208327-5050-3293-0000/executors/ecxconfigdb.1e6e0779-baf1-11e5-8c36-522bd4cc5ea9/runs/bfc5a419-30f8-43f7-af2f-5582394532f2 >>>>>>> --stop_timeout=15secs >>>>>>> root 7378 4967 0 14:00 ? 00:00:01 >>>>>>> mesos-docker-executor >>>>>>> --container=mesos-20160114-135722-1674208327-5050-4917-S0.9b700cdc-3d29-49b7-a7fc-e543a91f7b89 >>>>>>> --docker=/usr/local/ecxmcc/weaveShim --help=false >>>>>>> --mapped_directory=/mnt/mesos/sandbox >>>>>>> --sandbox_directory=/tmp/mesos/slaves/20160114-135722-1674208327-5050-4917-S0/frameworks/20160114-103414-1674208327-5050-3293-0000/executors/ecxcatalogdbs1.25911dda-baf1-11e5-8c36-522bd4cc5ea9/runs/9b700cdc-3d29-49b7-a7fc-e543a91f7b89 >>>>>>> --stop_timeout=15secs >>>>>>> root 7640 4967 0 14:01 ? 00:00:01 >>>>>>> mesos-docker-executor >>>>>>> --container=mesos-20160114-135722-1674208327-5050-4917-S0.d7d861d3-cfc9-424d-b341-0631edea4298 >>>>>>> --docker=/usr/local/ecxmcc/weaveShim --help=false >>>>>>> --mapped_directory=/mnt/mesos/sandbox >>>>>>> --sandbox_directory=/tmp/mesos/slaves/20160114-135722-1674208327-5050-4917-S0/frameworks/20160114-103414-1674208327-5050-3293-0000/executors/mongoconfig.2cb9163b-baf1-11e5-8c36-522bd4cc5ea9/runs/d7d861d3-cfc9-424d-b341-0631edea4298 >>>>>>> --stop_timeout=15secs >>>>>>> *root 9696 9695 0 14:06 ? 00:00:00 /usr/bin/docker >>>>>>> stop -t 15 >>>>>>> mesos-20160114-135722-1674208327-5050-4917-S0.d7d861d3-cfc9-424d-b341-0631edea4298* >>>>>>> *root 9709 9708 0 14:06 ? 00:00:00 /usr/bin/docker >>>>>>> stop -t 15 >>>>>>> mesos-20160114-135722-1674208327-5050-4917-S0.9b700cdc-3d29-49b7-a7fc-e543a91f7b89* >>>>>>> *root 9720 9719 0 14:06 ? 00:00:00 /usr/bin/docker >>>>>>> stop -t 15 >>>>>>> mesos-20160114-135722-1674208327-5050-4917-S0.bfc5a419-30f8-43f7-af2f-5582394532f2* >>>>>>> >>>>>>> *docker ps output:* >>>>>>> >>>>>>> root@71:~# docker ps >>>>>>> CONTAINER ID IMAGE COMMAND >>>>>>> CREATED STATUS PORTS >>>>>>> NAMES >>>>>>> 5abafbfe7de2 mongo:2.6.8 "/w/w >>>>>>> /entrypoint.sh " 11 minutes ago Up 11 minutes 27017/tcp >>>>>>> >>>>>>> >>>>>>> >>>>>>> mesos-20160114-135722-1674208327-5050-4917-S0.d7d861d3-cfc9-424d-b341-0631edea4298 >>>>>>> a8449682ca2e mongo:2.6.8 "/w/w >>>>>>> /entrypoint.sh " 11 minutes ago Up 11 minutes 27017/tcp >>>>>>> >>>>>>> >>>>>>> >>>>>>> mesos-20160114-135722-1674208327-5050-4917-S0.9b700cdc-3d29-49b7-a7fc-e543a91f7b89 >>>>>>> 3b956457374b mongo:2.6.8 "/w/w >>>>>>> /entrypoint.sh " 11 minutes ago Up 11 minutes 27017/tcp >>>>>>> >>>>>>> >>>>>>> >>>>>>> mesos-20160114-135722-1674208327-5050-4917-S0.bfc5a419-30f8-43f7-af2f-5582394532f2 >>>>>>> 4c1588bb3d4b weaveworks/weaveexec:v1.1.0 >>>>>>> "/home/weave/weavepro" 15 minutes ago Up 15 minutes >>>>>>> >>>>>>> weaveproxy >>>>>>> a26a0363584b weaveworks/weave:v1.1.0 >>>>>>> "/home/weave/weaver -" 15 minutes ago Up 15 minutes >>>>>>> 172.17.0.1:53->53/tcp, 172.17.0.1:53->53/udp, 0.0.0.0:6783->6783/tcp, >>>>>>> 0.0.0.0:6783->6783/udp weave >>>>>>> >>>>>>> *from /var/log/syslog:* >>>>>>> >>>>>>> >>>>>>> Jan 14 14:10:02 71 mesos-master[4917]: I0114 14:10:02.356405 5002 >>>>>>> master.cpp:2944] Asked to kill task >>>>>>> mongoconfig.2cb9163b-baf1-11e5-8c36-522bd4cc5ea9 of framework >>>>>>> 20160114-103414-1674208327-5050-3293-0000 >>>>>>> *Jan 14 14:10:02 71 mesos-master[4917]: I0114 14:10:02.356459 5002 >>>>>>> master.cpp:3034] Telling slave 20160114-135722-1674208327-5050-4917-S0 >>>>>>> at >>>>>>> slave(1)@71.100.202.99:5051 <http://71.100.202.99:5051> (71.100.202.99) >>>>>>> to >>>>>>> kill task mongoconfig.2cb9163b-baf1-11e5-8c36-522bd4cc5ea9 of framework >>>>>>> 20160114-103414-1674208327-5050-3293-0000 (marathon) at >>>>>>> scheduler-2a46bd53-a6fd-4202-b271-d633064acf7f@71.100.202.99:46167 >>>>>>> <http://scheduler-2a46bd53-a6fd-4202-b271-d633064acf7f@71.100.202.99:46167>* >>>>>>> *Jan 14 14:10:02 71 mesos-slave[4967]: I0114 14:10:02.356729 5042 >>>>>>> slave.cpp:1755] Asked to kill task >>>>>>> mongoconfig.2cb9163b-baf1-11e5-8c36-522bd4cc5ea9 of framework >>>>>>> 20160114-103414-1674208327-5050-3293-0000* >>>>>>> Jan 14 14:10:02 71 mesos-master[4917]: I0114 14:10:02.378295 5004 >>>>>>> http.cpp:283] HTTP GET for /master/state.json from >>>>>>> 172.19.15.61:65038 with User-Agent='Mozilla/5.0 (Windows NT 6.1; >>>>>>> WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 >>>>>>> Safari/537.36' >>>>>>> Jan 14 14:10:02 71 mesos-master[4917]: I0114 14:10:02.425904 5001 >>>>>>> master.cpp:2944] Asked to kill task >>>>>>> ecxcatalogdbs1.25911dda-baf1-11e5-8c36-522bd4cc5ea9 of framework >>>>>>> 20160114-103414-1674208327-5050-3293-0000 >>>>>>> Jan 14 14:10:02 71 mesos-master[4917]: I0114 14:10:02.425935 5001 >>>>>>> master.cpp:3034] Telling slave 20160114-135722-1674208327-5050-4917-S0 >>>>>>> at >>>>>>> slave(1)@71.100.202.99:5051 (71.100.202.99) to kill task >>>>>>> ecxcatalogdbs1.25911dda-baf1-11e5-8c36-522bd4cc5ea9 of framework >>>>>>> 20160114-103414-1674208327-5050-3293-0000 (marathon) at >>>>>>> scheduler-2a46bd53-a6fd-4202-b271-d633064acf7f@71.100.202.99:46167 >>>>>>> Jan 14 14:10:02 71 mesos-slave[4967]: I0114 14:10:02.426136 5041 >>>>>>> slave.cpp:1755] Asked to kill task >>>>>>> ecxcatalogdbs1.25911dda-baf1-11e5-8c36-522bd4cc5ea9 of framework >>>>>>> 20160114-103414-1674208327-5050-3293-0000 >>>>>>> Jan 14 14:10:02 71 mesos-master[4917]: I0114 14:10:02.435932 4998 >>>>>>> master.cpp:2944] Asked to kill task >>>>>>> ecxconfigdb.1e6e0779-baf1-11e5-8c36-522bd4cc5ea9 of framework >>>>>>> 20160114-103414-1674208327-5050-3293-0000 >>>>>>> Jan 14 14:10:02 71 mesos-master[4917]: I0114 14:10:02.435958 4998 >>>>>>> master.cpp:3034] Telling slave 20160114-135722-1674208327-5050-4917-S0 >>>>>>> at >>>>>>> slave(1)@71.100.202.99:5051 (71.100.202.99) to kill task >>>>>>> ecxconfigdb.1e6e0779-baf1-11e5-8c36-522bd4cc5ea9 of framework >>>>>>> 20160114-103414-1674208327-5050-3293-0000 (marathon) at >>>>>>> scheduler-2a46bd53-a6fd-4202-b271-d633064acf7f@71.100.202.99:46167 >>>>>>> Jan 14 14:10:02 71 mesos-slave[4967]: I0114 14:10:02.436151 5038 >>>>>>> slave.cpp:1755] Asked to kill task >>>>>>> ecxconfigdb.1e6e0779-baf1-11e5-8c36-522bd4cc5ea9 of framework >>>>>>> 20160114-103414-1674208327-5050-3293-0000 >>>>>>> Jan 14 14:10:03 71 mesos-master[4917]: I0114 14:10:03.759009 5001 >>>>>>> master.cpp:4290] Sending 1 offers to framework >>>>>>> 20160114-103414-1674208327-5050-3293-0000 (marathon) at >>>>>>> scheduler-2a46bd53-a6fd-4202-b271-d633064acf7f@71.100.202.99:46167 >>>>>>> Jan 14 14:10:03 71 marathon[4937]: [2016-01-14 14:10:03,765] INFO >>>>>>> started processing 1 offers, launching at most 1 tasks per offer and >>>>>>> 1000 >>>>>>> tasks in total (mesosphere.marathon.tasks.IterativeOfferMatcher$:132) >>>>>>> Jan 14 14:10:03 71 marathon[4937]: [2016-01-14 14:10:03,766] INFO >>>>>>> Offer [20160114-135722-1674208327-5050-4917-O128]. Decline with default >>>>>>> filter refuseSeconds (use --decline_offer_duration to configure) >>>>>>> (mesosphere.marathon.tasks.IterativeOfferMatcher$:231) >>>>>>> >>>>>>> >>>>>>> *from Mesos STDOUT of unstopped container:* >>>>>>> >>>>>>> Starting task mongoconfig.2cb9163b-baf1-11e5-8c36-522bd4cc5ea9 >>>>>>> 2016-01-14T19:01:10.997+0000 [initandlisten] MongoDB starting : >>>>>>> pid=1 port=27019 dbpath=/data/db/config master=1 64-bit host=mongoconfig >>>>>>> 2016-01-14T19:01:10.998+0000 [initandlisten] db version v2.6.8 >>>>>>> 2016-01-14T19:01:10.998+0000 [initandlisten] git version: >>>>>>> 3abc04d6d4f71de00b57378e3277def8fd7a6700 >>>>>>> 2016-01-14T19:01:10.998+0000 [initandlisten] build info: Linux >>>>>>> build5.nj1.10gen.cc 2.6.32-431.3.1.el6.x86_64 #1 SMP Fri Jan 3 >>>>>>> 21:39:27 UTC 2014 x86_64 BOOST_LIB_VERSION=1_49 >>>>>>> 2016-01-14T19:01:10.998+0000 [initandlisten] allocator: tcmalloc >>>>>>> 2016-01-14T19:01:10.998+0000 [initandlisten] options: { sharding: { >>>>>>> clusterRole: "configsvr" }, storage: { dbPath: "/data/db/config", >>>>>>> journal: >>>>>>> { enabled: true } } } >>>>>>> 2016-01-14T19:01:10.999+0000 [initandlisten] journal >>>>>>> dir=/data/db/config/journal >>>>>>> 2016-01-14T19:01:11.000+0000 [initandlisten] recover : no journal >>>>>>> files present, no recovery needed >>>>>>> 2016-01-14T19:01:11.429+0000 [initandlisten] warning: >>>>>>> ClientCursor::staticYield can't unlock b/c of recursive lock ns: top: { >>>>>>> opid: 11, active: true, secs_running: 0, microsecs_running: 36, op: >>>>>>> "query", ns: "local.oplog.$main", query: { query: {}, orderby: { >>>>>>> $natural: >>>>>>> -1 } }, client: "0.0.0.0:0", desc: "initandlisten", threadId: >>>>>>> "0x7f8f73075b40", locks: { ^: "W" }, waitingForLock: false, numYields: >>>>>>> 0, >>>>>>> lockStats: { timeLockedMicros: {}, timeAcquiringMicros: {} } } >>>>>>> 2016-01-14T19:01:11.429+0000 [initandlisten] waiting for connections >>>>>>> on port 27019 >>>>>>> 2016-01-14T19:01:17.405+0000 [initandlisten] connection accepted >>>>>>> from 10.2.0.3:51189 #1 (1 connection now open) >>>>>>> 2016-01-14T19:01:17.413+0000 [initandlisten] connection accepted >>>>>>> from 10.2.0.3:51190 #2 (2 connections now open) >>>>>>> 2016-01-14T19:01:17.413+0000 [initandlisten] connection accepted >>>>>>> from 10.2.0.3:51191 #3 (3 connections now open) >>>>>>> 2016-01-14T19:01:17.414+0000 [conn3] first cluster operation >>>>>>> detected, adding sharding hook to enable versioning and authentication >>>>>>> to >>>>>>> remote servers >>>>>>> 2016-01-14T19:01:17.414+0000 [conn3] CMD fsync: sync:1 lock:0 >>>>>>> 2016-01-14T19:01:17.415+0000 [conn3] CMD fsync: sync:1 lock:0 >>>>>>> 2016-01-14T19:01:17.415+0000 [conn3] CMD fsync: sync:1 lock:0 >>>>>>> 2016-01-14T19:01:17.415+0000 [conn3] CMD fsync: sync:1 lock:0 >>>>>>> 2016-01-14T19:01:17.416+0000 [conn3] CMD fsync: sync:1 lock:0 >>>>>>> 2016-01-14T19:01:17.416+0000 [conn3] CMD fsync: sync:1 lock:0 >>>>>>> 2016-01-14T19:01:17.416+0000 [conn3] CMD fsync: sync:1 lock:0 >>>>>>> 2016-01-14T19:01:17.419+0000 [initandlisten] connection accepted >>>>>>> from 10.2.0.3:51193 #4 (4 connections now open) >>>>>>> 2016-01-14T19:01:17.420+0000 [initandlisten] connection accepted >>>>>>> from 10.2.0.3:51194 #5 (5 connections now open) >>>>>>> 2016-01-14T19:01:17.442+0000 [conn1] end connection 10.2.0.3:51189 >>>>>>> (4 connections now open) >>>>>>> 2016-01-14T19:02:11.285+0000 [clientcursormon] mem (MB) res:59 >>>>>>> virt:385 >>>>>>> 2016-01-14T19:02:11.285+0000 [clientcursormon] mapped (incl journal >>>>>>> view):192 >>>>>>> 2016-01-14T19:02:11.285+0000 [clientcursormon] connections:4 >>>>>>> 2016-01-14T19:03:11.293+0000 [clientcursormon] mem (MB) res:72 >>>>>>> virt:385 >>>>>>> 2016-01-14T19:03:11.294+0000 [clientcursormon] mapped (incl journal >>>>>>> view):192 >>>>>>> 2016-01-14T19:03:11.294+0000 [clientcursormon] connections:4 >>>>>>> Killing docker task >>>>>>> Shutting down >>>>>>> Killing docker task >>>>>>> Shutting down >>>>>>> Killing docker task >>>>>>> Shutting down >>>>>>> Killing docker task >>>>>>> Shutting down >>>>>>> Killing docker task >>>>>>> Shutting down >>>>>>> Killing docker task >>>>>>> Shutting down >>>>>>> Killing docker task >>>>>>> Shutting down >>>>>>> Killing docker task >>>>>>> Shutting down >>>>>>> Killing docker task >>>>>>> Shutting down >>>>>>> Killing docker task >>>>>>> Shutting down >>>>>>> Killing docker task >>>>>>> Shutting down >>>>>>> Killing docker task >>>>>>> Shutting down >>>>>>> Killing docker task >>>>>>> Shutting down >>>>>>> Killing docker task >>>>>>> Shutting down >>>>>>> Killing docker task >>>>>>> Shutting down >>>>>>> Killing docker task >>>>>>> >>>>>>> Most disturbing in all of this is that while I can stop the >>>>>>> deployments in Marathon (which properly end the "docker stop" commands >>>>>>> visible in ps output), I can not bounce docker, not by Upstart, nor via >>>>>>> kill command). Ultimately, I have to reboot the VM. >>>>>>> >>>>>>> FWIW, the 3 mongod containers (apparently stuck in their Killing >>>>>>> docker task/shutting down loop) are running at 100%CPU as evinced by >>>>>>> both >>>>>>> "docker stats" and "top". >>>>>>> >>>>>>> I would truly be grateful for some guidance on this - even a mere >>>>>>> work-around would be appreciated. >>>>>>> >>>>>>> Thank you. >>>>>>> >>>>>>> -Paul >>>>>>> >>>>>> >>>>>> >>>>> >>>> >>> >> >