Help needed (alas, urgently)

Paul Bell Thu, 14 Jan 2016 11:52:50 -0800

Hi All,

It's been quite some time since I've posted here and that's chiefly because
up until a day or two ago, things were working really well.


I actually may have posted about this some time back. But then the problem
seemed more intermittent.

In summa, several "docker stops" don't work, i.e., the containers are not
stopped.

Deployment:

one Ubuntu VM (vmWare) LTS 14.04 with kernel 3.19
Zookeeper
Mesos-master (0.23.0)
Mesos-slave (0.23.0)
Marathon (0.10.0)
Docker 1.9.1
Weave 1.1.0
Our application contains which include
    MongoDB (4)
    PostGres
    ECX (our product)

The only thing that's changed at all in the config above is the version of
Docker. Used to be 1.6.2 but I today upgraded it hoping to solve the
problem.


My automater program stops the application by sending Marathon an "http
delete" for each running up. Every now & then (reliably reproducible today)
not all containers get stopped. Most recently, 3 containers failed to stop.

Here are the attendant phenomena:

Marathon shows the 3 applications in deployment mode (presumably
"deployment" in the sense of "stopping")

*ps output:*

root@71:~# ps -ef | grep docker
root      3823     1  0 13:55 ?        00:00:02 /usr/bin/docker daemon -H
unix:///var/run/docker.sock -H tcp://0.0.0.0:4243
root      4967     1  0 13:57 ?        00:00:01 /usr/sbin/mesos-slave
--master=zk://71.100.202.99:2181/mesos --log_dir=/var/log/mesos
--containerizers=docker,mesos --docker=/usr/local/ecxmcc/weaveShim
--docker_stop_timeout=15secs --executor_registration_timeout=5mins
--hostname=71.100.202.99 --ip=71.100.202.99
--attributes=hostType:ecx,shard1 --resources=ports:[31000-31999,8443-8443]
root      5263  3823  0 13:57 ?        00:00:00 docker-proxy -proto tcp
-host-ip 0.0.0.0 -host-port 6783 -container-ip 172.17.0.2 -container-port
6783
root      5271  3823  0 13:57 ?        00:00:00 docker-proxy -proto udp
-host-ip 0.0.0.0 -host-port 6783 -container-ip 172.17.0.2 -container-port
6783
root      5279  3823  0 13:57 ?        00:00:00 docker-proxy -proto tcp
-host-ip 172.17.0.1 -host-port 53 -container-ip 172.17.0.2 -container-port
53
root      5287  3823  0 13:57 ?        00:00:00 docker-proxy -proto udp
-host-ip 172.17.0.1 -host-port 53 -container-ip 172.17.0.2 -container-port
53
root      7119  4967  0 14:00 ?        00:00:01 mesos-docker-executor
--container=mesos-20160114-135722-1674208327-5050-4917-S0.bfc5a419-30f8-43f7-af2f-5582394532f2
--docker=/usr/local/ecxmcc/weaveShim --help=false
--mapped_directory=/mnt/mesos/sandbox
--sandbox_directory=/tmp/mesos/slaves/20160114-135722-1674208327-5050-4917-S0/frameworks/20160114-103414-1674208327-5050-3293-0000/executors/ecxconfigdb.1e6e0779-baf1-11e5-8c36-522bd4cc5ea9/runs/bfc5a419-30f8-43f7-af2f-5582394532f2
--stop_timeout=15secs
root      7378  4967  0 14:00 ?        00:00:01 mesos-docker-executor
--container=mesos-20160114-135722-1674208327-5050-4917-S0.9b700cdc-3d29-49b7-a7fc-e543a91f7b89
--docker=/usr/local/ecxmcc/weaveShim --help=false
--mapped_directory=/mnt/mesos/sandbox
--sandbox_directory=/tmp/mesos/slaves/20160114-135722-1674208327-5050-4917-S0/frameworks/20160114-103414-1674208327-5050-3293-0000/executors/ecxcatalogdbs1.25911dda-baf1-11e5-8c36-522bd4cc5ea9/runs/9b700cdc-3d29-49b7-a7fc-e543a91f7b89
--stop_timeout=15secs
root      7640  4967  0 14:01 ?        00:00:01 mesos-docker-executor
--container=mesos-20160114-135722-1674208327-5050-4917-S0.d7d861d3-cfc9-424d-b341-0631edea4298
--docker=/usr/local/ecxmcc/weaveShim --help=false
--mapped_directory=/mnt/mesos/sandbox
--sandbox_directory=/tmp/mesos/slaves/20160114-135722-1674208327-5050-4917-S0/frameworks/20160114-103414-1674208327-5050-3293-0000/executors/mongoconfig.2cb9163b-baf1-11e5-8c36-522bd4cc5ea9/runs/d7d861d3-cfc9-424d-b341-0631edea4298
--stop_timeout=15secs
*root      9696  9695  0 14:06 ?        00:00:00 /usr/bin/docker stop -t 15
mesos-20160114-135722-1674208327-5050-4917-S0.d7d861d3-cfc9-424d-b341-0631edea4298*
*root      9709  9708  0 14:06 ?        00:00:00 /usr/bin/docker stop -t 15
mesos-20160114-135722-1674208327-5050-4917-S0.9b700cdc-3d29-49b7-a7fc-e543a91f7b89*
*root      9720  9719  0 14:06 ?        00:00:00 /usr/bin/docker stop -t 15
mesos-20160114-135722-1674208327-5050-4917-S0.bfc5a419-30f8-43f7-af2f-5582394532f2*

*docker ps output:*

root@71:~# docker ps
CONTAINER ID        IMAGE                         COMMAND
 CREATED             STATUS              PORTS
                                                             NAMES
5abafbfe7de2        mongo:2.6.8                   "/w/w /entrypoint.sh "
11 minutes ago      Up 11 minutes       27017/tcp

 
mesos-20160114-135722-1674208327-5050-4917-S0.d7d861d3-cfc9-424d-b341-0631edea4298
a8449682ca2e        mongo:2.6.8                   "/w/w /entrypoint.sh "
11 minutes ago      Up 11 minutes       27017/tcp

 
mesos-20160114-135722-1674208327-5050-4917-S0.9b700cdc-3d29-49b7-a7fc-e543a91f7b89
3b956457374b        mongo:2.6.8                   "/w/w /entrypoint.sh "
11 minutes ago      Up 11 minutes       27017/tcp

 
mesos-20160114-135722-1674208327-5050-4917-S0.bfc5a419-30f8-43f7-af2f-5582394532f2
4c1588bb3d4b        weaveworks/weaveexec:v1.1.0   "/home/weave/weavepro"
15 minutes ago      Up 15 minutes
                                                           weaveproxy
a26a0363584b        weaveworks/weave:v1.1.0       "/home/weave/weaver -"
15 minutes ago      Up 15 minutes       172.17.0.1:53->53/tcp,
172.17.0.1:53->53/udp,
0.0.0.0:6783->6783/tcp, 0.0.0.0:6783->6783/udp   weave

*from /var/log/syslog:*


Jan 14 14:10:02 71 mesos-master[4917]: I0114 14:10:02.356405  5002
master.cpp:2944] Asked to kill task
mongoconfig.2cb9163b-baf1-11e5-8c36-522bd4cc5ea9 of framework
20160114-103414-1674208327-5050-3293-0000
*Jan 14 14:10:02 71 mesos-master[4917]: I0114 14:10:02.356459  5002
master.cpp:3034] Telling slave 20160114-135722-1674208327-5050-4917-S0 at
slave(1)@71.100.202.99:5051 <http://71.100.202.99:5051> (71.100.202.99) to
kill task mongoconfig.2cb9163b-baf1-11e5-8c36-522bd4cc5ea9 of framework
20160114-103414-1674208327-5050-3293-0000 (marathon) at
scheduler-2a46bd53-a6fd-4202-b271-d633064acf7f@71.100.202.99:46167
<http://scheduler-2a46bd53-a6fd-4202-b271-d633064acf7f@71.100.202.99:46167>*
*Jan 14 14:10:02 71 mesos-slave[4967]: I0114 14:10:02.356729  5042
slave.cpp:1755] Asked to kill task
mongoconfig.2cb9163b-baf1-11e5-8c36-522bd4cc5ea9 of framework
20160114-103414-1674208327-5050-3293-0000*
Jan 14 14:10:02 71 mesos-master[4917]: I0114 14:10:02.378295  5004
http.cpp:283] HTTP GET for /master/state.json from 172.19.15.61:65038 with
User-Agent='Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML,
like Gecko) Chrome/46.0.2490.86 Safari/537.36'
Jan 14 14:10:02 71 mesos-master[4917]: I0114 14:10:02.425904  5001
master.cpp:2944] Asked to kill task
ecxcatalogdbs1.25911dda-baf1-11e5-8c36-522bd4cc5ea9 of framework
20160114-103414-1674208327-5050-3293-0000
Jan 14 14:10:02 71 mesos-master[4917]: I0114 14:10:02.425935  5001
master.cpp:3034] Telling slave 20160114-135722-1674208327-5050-4917-S0 at
slave(1)@71.100.202.99:5051 (71.100.202.99) to kill task
ecxcatalogdbs1.25911dda-baf1-11e5-8c36-522bd4cc5ea9 of framework
20160114-103414-1674208327-5050-3293-0000 (marathon) at
scheduler-2a46bd53-a6fd-4202-b271-d633064acf7f@71.100.202.99:46167
Jan 14 14:10:02 71 mesos-slave[4967]: I0114 14:10:02.426136  5041
slave.cpp:1755] Asked to kill task
ecxcatalogdbs1.25911dda-baf1-11e5-8c36-522bd4cc5ea9 of framework
20160114-103414-1674208327-5050-3293-0000
Jan 14 14:10:02 71 mesos-master[4917]: I0114 14:10:02.435932  4998
master.cpp:2944] Asked to kill task
ecxconfigdb.1e6e0779-baf1-11e5-8c36-522bd4cc5ea9 of framework
20160114-103414-1674208327-5050-3293-0000
Jan 14 14:10:02 71 mesos-master[4917]: I0114 14:10:02.435958  4998
master.cpp:3034] Telling slave 20160114-135722-1674208327-5050-4917-S0 at
slave(1)@71.100.202.99:5051 (71.100.202.99) to kill task
ecxconfigdb.1e6e0779-baf1-11e5-8c36-522bd4cc5ea9 of framework
20160114-103414-1674208327-5050-3293-0000 (marathon) at
scheduler-2a46bd53-a6fd-4202-b271-d633064acf7f@71.100.202.99:46167
Jan 14 14:10:02 71 mesos-slave[4967]: I0114 14:10:02.436151  5038
slave.cpp:1755] Asked to kill task
ecxconfigdb.1e6e0779-baf1-11e5-8c36-522bd4cc5ea9 of framework
20160114-103414-1674208327-5050-3293-0000
Jan 14 14:10:03 71 mesos-master[4917]: I0114 14:10:03.759009  5001
master.cpp:4290] Sending 1 offers to framework
20160114-103414-1674208327-5050-3293-0000 (marathon) at
scheduler-2a46bd53-a6fd-4202-b271-d633064acf7f@71.100.202.99:46167
Jan 14 14:10:03 71 marathon[4937]: [2016-01-14 14:10:03,765] INFO started
processing 1 offers, launching at most 1 tasks per offer and 1000 tasks in
total (mesosphere.marathon.tasks.IterativeOfferMatcher$:132)
Jan 14 14:10:03 71 marathon[4937]: [2016-01-14 14:10:03,766] INFO Offer
[20160114-135722-1674208327-5050-4917-O128]. Decline with default filter
refuseSeconds (use --decline_offer_duration to configure)
(mesosphere.marathon.tasks.IterativeOfferMatcher$:231)


*from Mesos STDOUT of unstopped container:*

Starting task mongoconfig.2cb9163b-baf1-11e5-8c36-522bd4cc5ea9
2016-01-14T19:01:10.997+0000 [initandlisten] MongoDB starting : pid=1
port=27019 dbpath=/data/db/config master=1 64-bit host=mongoconfig
2016-01-14T19:01:10.998+0000 [initandlisten] db version v2.6.8
2016-01-14T19:01:10.998+0000 [initandlisten] git version:
3abc04d6d4f71de00b57378e3277def8fd7a6700
2016-01-14T19:01:10.998+0000 [initandlisten] build info: Linux
build5.nj1.10gen.cc 2.6.32-431.3.1.el6.x86_64 #1 SMP Fri Jan 3 21:39:27 UTC
2014 x86_64 BOOST_LIB_VERSION=1_49
2016-01-14T19:01:10.998+0000 [initandlisten] allocator: tcmalloc
2016-01-14T19:01:10.998+0000 [initandlisten] options: { sharding: {
clusterRole: "configsvr" }, storage: { dbPath: "/data/db/config", journal:
{ enabled: true } } }
2016-01-14T19:01:10.999+0000 [initandlisten] journal
dir=/data/db/config/journal
2016-01-14T19:01:11.000+0000 [initandlisten] recover : no journal files
present, no recovery needed
2016-01-14T19:01:11.429+0000 [initandlisten] warning:
ClientCursor::staticYield can't unlock b/c of recursive lock ns:  top: {
opid: 11, active: true, secs_running: 0, microsecs_running: 36, op:
"query", ns: "local.oplog.$main", query: { query: {}, orderby: { $natural:
-1 } }, client: "0.0.0.0:0", desc: "initandlisten", threadId:
"0x7f8f73075b40", locks: { ^: "W" }, waitingForLock: false, numYields: 0,
lockStats: { timeLockedMicros: {}, timeAcquiringMicros: {} } }
2016-01-14T19:01:11.429+0000 [initandlisten] waiting for connections on
port 27019
2016-01-14T19:01:17.405+0000 [initandlisten] connection accepted from
10.2.0.3:51189 #1 (1 connection now open)
2016-01-14T19:01:17.413+0000 [initandlisten] connection accepted from
10.2.0.3:51190 #2 (2 connections now open)
2016-01-14T19:01:17.413+0000 [initandlisten] connection accepted from
10.2.0.3:51191 #3 (3 connections now open)
2016-01-14T19:01:17.414+0000 [conn3] first cluster operation detected,
adding sharding hook to enable versioning and authentication to remote
servers
2016-01-14T19:01:17.414+0000 [conn3] CMD fsync: sync:1 lock:0
2016-01-14T19:01:17.415+0000 [conn3] CMD fsync: sync:1 lock:0
2016-01-14T19:01:17.415+0000 [conn3] CMD fsync: sync:1 lock:0
2016-01-14T19:01:17.415+0000 [conn3] CMD fsync: sync:1 lock:0
2016-01-14T19:01:17.416+0000 [conn3] CMD fsync: sync:1 lock:0
2016-01-14T19:01:17.416+0000 [conn3] CMD fsync: sync:1 lock:0
2016-01-14T19:01:17.416+0000 [conn3] CMD fsync: sync:1 lock:0
2016-01-14T19:01:17.419+0000 [initandlisten] connection accepted from
10.2.0.3:51193 #4 (4 connections now open)
2016-01-14T19:01:17.420+0000 [initandlisten] connection accepted from
10.2.0.3:51194 #5 (5 connections now open)
2016-01-14T19:01:17.442+0000 [conn1] end connection 10.2.0.3:51189 (4
connections now open)
2016-01-14T19:02:11.285+0000 [clientcursormon] mem (MB) res:59 virt:385
2016-01-14T19:02:11.285+0000 [clientcursormon]  mapped (incl journal
view):192
2016-01-14T19:02:11.285+0000 [clientcursormon]  connections:4
2016-01-14T19:03:11.293+0000 [clientcursormon] mem (MB) res:72 virt:385
2016-01-14T19:03:11.294+0000 [clientcursormon]  mapped (incl journal
view):192
2016-01-14T19:03:11.294+0000 [clientcursormon]  connections:4
Killing docker task
Shutting down
Killing docker task
Shutting down
Killing docker task
Shutting down
Killing docker task
Shutting down
Killing docker task
Shutting down
Killing docker task
Shutting down
Killing docker task
Shutting down
Killing docker task
Shutting down
Killing docker task
Shutting down
Killing docker task
Shutting down
Killing docker task
Shutting down
Killing docker task
Shutting down
Killing docker task
Shutting down
Killing docker task
Shutting down
Killing docker task
Shutting down
Killing docker task

Most disturbing in all of this is that while I can stop the deployments in
Marathon (which properly end the "docker stop" commands visible in ps
output), I can not bounce docker, not by Upstart, nor via kill command).
Ultimately, I have to reboot the VM.

FWIW, the 3 mongod containers (apparently stuck in their Killing docker
task/shutting down loop) are running at 100%CPU as evinced by both "docker
stats" and "top".

I would truly be grateful for some guidance on this - even a mere
work-around would be appreciated.

Thank you.

-Paul

Help needed (alas, urgently)

Reply via email to