Re: Help needed (alas, urgently)

2016-01-15 Thread Paul Bell
In chasing down this problem, I stumbled upon something of moment: the
problem does NOT seem to happen with kernel 3.13.

Some weeks back, in the hope of getting past another problem wherein the
root filesystem "becomes" R/O, I upgraded from 3.13 to 3.19 (Ubuntu 14.04
LTS). The kernel upgrade was done as shown here (there's some extra stuff
to get rid of Ubuntu desktop and liberate some disk space):

  apt-get update
  apt-get -y remove unbuntu-desktop
  apt-get -y purge lightdm
  rm -Rf /var/lib/lightdm-data
  apt-get -y remove --purge libreoffice-core
  apt-get -y remove --purge libreoffice-common

  echo "  Installing new kernel"

  apt-get -y install linux-generic-lts-vivid
  apt-get -y autoremove linux-image-3.13.0-32-generic
  apt-get -y autoremove linux-image-3.13.0-71-generic
  update-grub
  reboot

After the reboot, a "uname -r" shows kernel 3.19.0-42-generic.

Under this kernel I can now reliably reproduce the failure to stop a
MongoDB container. Specifically, any & all attempts to kill the container,
e.g.,via

Marathon HTTP Delete (which leads to docker-mesos-executor presenting
"docker stop" command)
Getting inside the running container shell and issuing "kill" or
db.shutDown()

causes the mongod container

   - to show in its log that it's shutting down normally
   - to enter a 100% CPU loop
   - to become unkillable (only reboot "fixes" things)

Note finally that my conclusion about kernel 3.13 "working" is at present a
weak induction. But I do know that when I reverted to that kernel I could,
at least once, stop the containers w/o any problems; whereas at 3.19 I can
reliably reproduce the problem. I will try to make this induction stronger
as the day wears on.

Did I do something "wrong" in my kernel upgrade steps?

Is anyone aware of such an issue in 3.19 or of work done post-3.13 in the
area of task termination & signal handling?

Thanks for your help.

-Paul


On Thu, Jan 14, 2016 at 5:14 PM, Paul Bell  wrote:

> I spoke to soon, I'm afraid.
>
> Next time I did the stop (with zero timeout), I see the same phenomenon: a
> mongo container showing repeated:
>
> killing docker task
> shutting down
>
>
> What else can I try?
>
> Thank you.
>
> On Thu, Jan 14, 2016 at 5:07 PM, Paul Bell  wrote:
>
>> Hi Tim,
>>
>> I set docker_stop_timeout to zero as you asked. I am pleased to report
>> (though a bit fearful about being pleased) that this change seems to have
>> shut everyone down pretty much instantly.
>>
>> Can you explain what's happening, e.g., does docker_stop_timeout=0 cause
>> the immediate use of "kill -9" as opposed to "kill -2"?
>>
>> I will keep testing the behavior.
>>
>> Thank you.
>>
>> -Paul
>>
>> On Thu, Jan 14, 2016 at 3:59 PM, Paul Bell  wrote:
>>
>>> Hi Tim,
>>>
>>> Things have gotten slightly odder (if that's possible). When I now start
>>> the application 5 or so containers, only one "ecxconfigdb" gets started -
>>> and even he took a few tries. That is, I see him failing, moving to
>>> deploying, then starting again. But I've no evidence (no STDOUT, and no
>>> docker ctr logs) that show why.
>>>
>>> In any event, ecxconfigdb does start. Happily, when I try to stop the
>>> application I am seeing the phenomena I posted before: killing docker task,
>>> shutting down repeated many times. The UN-stopped container is now running
>>> at 100% CPU.
>>>
>>> I will try modifying docker_stop_timeout. Back shortly
>>>
>>> Thanks again.
>>>
>>> -Paul
>>>
>>> PS: what do you make of the "broken pipe" error in the docker.log?
>>>
>>> *from /var/log/upstart/docker.log*
>>>
>>> [34mINFO [0m[3054] GET /v1.15/images/mongo:2.6.8/json
>>> [34mINFO [0m[3054] GET
>>> /v1.21/images/mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b/json
>>> [31mERRO [0m[3054] Handler for GET
>>> /v1.21/images/mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b/json
>>> returned error: No such image:
>>> mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b
>>> [31mERRO [0m[3054] HTTP Error
>>> [31merr [0m=No such image:
>>> mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b
>>> [31mstatusCode [0m=404
>>> [34mINFO [0m[3054] GET /v1.15/containers/weave/json
>>> [34mINFO [0m[3054] POST
>>> /v1.21/containers/create?name=mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b
>>> [34mINFO [0m[3054] POST
>>> /v1.21/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/attach?stderr=1=1=1
>>> [34mINFO [0m[3054] POST
>>> /v1.21/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/start
>>> [34mINFO [0m[3054] GET
>>> /v1.15/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/json
>>> [34mINFO [0m[3054] GET
>>> /v1.15/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/json
>>> [34mINFO [0m[3054] GET 

Re: Help needed (alas, urgently)

2016-01-15 Thread Tim Chen
Hi Paul,

No problem, I haven't even spent much time yet and glad you resolved the
problem yourself.

We always welcome people doing this :)

Tim

On Fri, Jan 15, 2016 at 10:48 AM, Paul Bell  wrote:

> Tim,
>
> I've tracked down the cause of this problem: it's the result of some kind
> of incompatibility between kernel 3.19 and "VMware Tools". I know little
> more than that.
>
> I installed VMware Tools via *apt-get install open-vm-tools-lts-trusty*.
> Everything worked fine on 3.13. But when I upgrade to 3.19, the error
> occurs quite reliably. Revert back to 3.13 and the error goes away.
>
> I looked high & low for some statement of kernel requirements for VMware
> Tools, but can find none.
>
> Sorry to have wasted your time.
>
> -Paul
>
> On Fri, Jan 15, 2016 at 9:19 AM, Paul Bell  wrote:
>
>> In chasing down this problem, I stumbled upon something of moment: the
>> problem does NOT seem to happen with kernel 3.13.
>>
>> Some weeks back, in the hope of getting past another problem wherein the
>> root filesystem "becomes" R/O, I upgraded from 3.13 to 3.19 (Ubuntu 14.04
>> LTS). The kernel upgrade was done as shown here (there's some extra stuff
>> to get rid of Ubuntu desktop and liberate some disk space):
>>
>>   apt-get update
>>   apt-get -y remove unbuntu-desktop
>>   apt-get -y purge lightdm
>>   rm -Rf /var/lib/lightdm-data
>>   apt-get -y remove --purge libreoffice-core
>>   apt-get -y remove --purge libreoffice-common
>>
>>   echo "  Installing new kernel"
>>
>>   apt-get -y install linux-generic-lts-vivid
>>   apt-get -y autoremove linux-image-3.13.0-32-generic
>>   apt-get -y autoremove linux-image-3.13.0-71-generic
>>   update-grub
>>   reboot
>>
>> After the reboot, a "uname -r" shows kernel 3.19.0-42-generic.
>>
>> Under this kernel I can now reliably reproduce the failure to stop a
>> MongoDB container. Specifically, any & all attempts to kill the container,
>> e.g.,via
>>
>> Marathon HTTP Delete (which leads to docker-mesos-executor presenting
>> "docker stop" command)
>> Getting inside the running container shell and issuing "kill" or
>> db.shutDown()
>>
>> causes the mongod container
>>
>>- to show in its log that it's shutting down normally
>>- to enter a 100% CPU loop
>>- to become unkillable (only reboot "fixes" things)
>>
>> Note finally that my conclusion about kernel 3.13 "working" is at present
>> a weak induction. But I do know that when I reverted to that kernel I
>> could, at least once, stop the containers w/o any problems; whereas at 3.19
>> I can reliably reproduce the problem. I will try to make this induction
>> stronger as the day wears on.
>>
>> Did I do something "wrong" in my kernel upgrade steps?
>>
>> Is anyone aware of such an issue in 3.19 or of work done post-3.13 in the
>> area of task termination & signal handling?
>>
>> Thanks for your help.
>>
>> -Paul
>>
>>
>> On Thu, Jan 14, 2016 at 5:14 PM, Paul Bell  wrote:
>>
>>> I spoke to soon, I'm afraid.
>>>
>>> Next time I did the stop (with zero timeout), I see the same phenomenon:
>>> a mongo container showing repeated:
>>>
>>> killing docker task
>>> shutting down
>>>
>>>
>>> What else can I try?
>>>
>>> Thank you.
>>>
>>> On Thu, Jan 14, 2016 at 5:07 PM, Paul Bell  wrote:
>>>
 Hi Tim,

 I set docker_stop_timeout to zero as you asked. I am pleased to report
 (though a bit fearful about being pleased) that this change seems to have
 shut everyone down pretty much instantly.

 Can you explain what's happening, e.g., does docker_stop_timeout=0
 cause the immediate use of "kill -9" as opposed to "kill -2"?

 I will keep testing the behavior.

 Thank you.

 -Paul

 On Thu, Jan 14, 2016 at 3:59 PM, Paul Bell  wrote:

> Hi Tim,
>
> Things have gotten slightly odder (if that's possible). When I now
> start the application 5 or so containers, only one "ecxconfigdb" gets
> started - and even he took a few tries. That is, I see him failing, moving
> to deploying, then starting again. But I've no evidence (no STDOUT, and no
> docker ctr logs) that show why.
>
> In any event, ecxconfigdb does start. Happily, when I try to stop the
> application I am seeing the phenomena I posted before: killing docker 
> task,
> shutting down repeated many times. The UN-stopped container is now running
> at 100% CPU.
>
> I will try modifying docker_stop_timeout. Back shortly
>
> Thanks again.
>
> -Paul
>
> PS: what do you make of the "broken pipe" error in the docker.log?
>
> *from /var/log/upstart/docker.log*
>
> [34mINFO [0m[3054] GET /v1.15/images/mongo:2.6.8/json
> [34mINFO [0m[3054] GET
> /v1.21/images/mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b/json
> [31mERRO [0m[3054] Handler for GET
> 

Help needed (alas, urgently)

2016-01-14 Thread Paul Bell
Hi All,

It's been quite some time since I've posted here and that's chiefly because
up until a day or two ago, things were working really well.

I actually may have posted about this some time back. But then the problem
seemed more intermittent.

In summa, several "docker stops" don't work, i.e., the containers are not
stopped.

Deployment:

one Ubuntu VM (vmWare) LTS 14.04 with kernel 3.19
Zookeeper
Mesos-master (0.23.0)
Mesos-slave (0.23.0)
Marathon (0.10.0)
Docker 1.9.1
Weave 1.1.0
Our application contains which include
MongoDB (4)
PostGres
ECX (our product)

The only thing that's changed at all in the config above is the version of
Docker. Used to be 1.6.2 but I today upgraded it hoping to solve the
problem.


My automater program stops the application by sending Marathon an "http
delete" for each running up. Every now & then (reliably reproducible today)
not all containers get stopped. Most recently, 3 containers failed to stop.

Here are the attendant phenomena:

Marathon shows the 3 applications in deployment mode (presumably
"deployment" in the sense of "stopping")

*ps output:*

root@71:~# ps -ef | grep docker
root  3823 1  0 13:55 ?00:00:02 /usr/bin/docker daemon -H
unix:///var/run/docker.sock -H tcp://0.0.0.0:4243
root  4967 1  0 13:57 ?00:00:01 /usr/sbin/mesos-slave
--master=zk://71.100.202.99:2181/mesos --log_dir=/var/log/mesos
--containerizers=docker,mesos --docker=/usr/local/ecxmcc/weaveShim
--docker_stop_timeout=15secs --executor_registration_timeout=5mins
--hostname=71.100.202.99 --ip=71.100.202.99
--attributes=hostType:ecx,shard1 --resources=ports:[31000-31999,8443-8443]
root  5263  3823  0 13:57 ?00:00:00 docker-proxy -proto tcp
-host-ip 0.0.0.0 -host-port 6783 -container-ip 172.17.0.2 -container-port
6783
root  5271  3823  0 13:57 ?00:00:00 docker-proxy -proto udp
-host-ip 0.0.0.0 -host-port 6783 -container-ip 172.17.0.2 -container-port
6783
root  5279  3823  0 13:57 ?00:00:00 docker-proxy -proto tcp
-host-ip 172.17.0.1 -host-port 53 -container-ip 172.17.0.2 -container-port
53
root  5287  3823  0 13:57 ?00:00:00 docker-proxy -proto udp
-host-ip 172.17.0.1 -host-port 53 -container-ip 172.17.0.2 -container-port
53
root  7119  4967  0 14:00 ?00:00:01 mesos-docker-executor
--container=mesos-20160114-135722-1674208327-5050-4917-S0.bfc5a419-30f8-43f7-af2f-5582394532f2
--docker=/usr/local/ecxmcc/weaveShim --help=false
--mapped_directory=/mnt/mesos/sandbox
--sandbox_directory=/tmp/mesos/slaves/20160114-135722-1674208327-5050-4917-S0/frameworks/20160114-103414-1674208327-5050-3293-/executors/ecxconfigdb.1e6e0779-baf1-11e5-8c36-522bd4cc5ea9/runs/bfc5a419-30f8-43f7-af2f-5582394532f2
--stop_timeout=15secs
root  7378  4967  0 14:00 ?00:00:01 mesos-docker-executor
--container=mesos-20160114-135722-1674208327-5050-4917-S0.9b700cdc-3d29-49b7-a7fc-e543a91f7b89
--docker=/usr/local/ecxmcc/weaveShim --help=false
--mapped_directory=/mnt/mesos/sandbox
--sandbox_directory=/tmp/mesos/slaves/20160114-135722-1674208327-5050-4917-S0/frameworks/20160114-103414-1674208327-5050-3293-/executors/ecxcatalogdbs1.25911dda-baf1-11e5-8c36-522bd4cc5ea9/runs/9b700cdc-3d29-49b7-a7fc-e543a91f7b89
--stop_timeout=15secs
root  7640  4967  0 14:01 ?00:00:01 mesos-docker-executor
--container=mesos-20160114-135722-1674208327-5050-4917-S0.d7d861d3-cfc9-424d-b341-0631edea4298
--docker=/usr/local/ecxmcc/weaveShim --help=false
--mapped_directory=/mnt/mesos/sandbox
--sandbox_directory=/tmp/mesos/slaves/20160114-135722-1674208327-5050-4917-S0/frameworks/20160114-103414-1674208327-5050-3293-/executors/mongoconfig.2cb9163b-baf1-11e5-8c36-522bd4cc5ea9/runs/d7d861d3-cfc9-424d-b341-0631edea4298
--stop_timeout=15secs
*root  9696  9695  0 14:06 ?00:00:00 /usr/bin/docker stop -t 15
mesos-20160114-135722-1674208327-5050-4917-S0.d7d861d3-cfc9-424d-b341-0631edea4298*
*root  9709  9708  0 14:06 ?00:00:00 /usr/bin/docker stop -t 15
mesos-20160114-135722-1674208327-5050-4917-S0.9b700cdc-3d29-49b7-a7fc-e543a91f7b89*
*root  9720  9719  0 14:06 ?00:00:00 /usr/bin/docker stop -t 15
mesos-20160114-135722-1674208327-5050-4917-S0.bfc5a419-30f8-43f7-af2f-5582394532f2*

*docker ps output:*

root@71:~# docker ps
CONTAINER IDIMAGE COMMAND
 CREATED STATUS  PORTS
 NAMES
5abafbfe7de2mongo:2.6.8   "/w/w /entrypoint.sh "
11 minutes ago  Up 11 minutes   27017/tcp

 
mesos-20160114-135722-1674208327-5050-4917-S0.d7d861d3-cfc9-424d-b341-0631edea4298
a8449682ca2emongo:2.6.8   "/w/w /entrypoint.sh "
11 minutes ago  Up 11 minutes   27017/tcp

 
mesos-20160114-135722-1674208327-5050-4917-S0.9b700cdc-3d29-49b7-a7fc-e543a91f7b89
3b956457374bmongo:2.6.8   "/w/w /entrypoint.sh "
11 minutes ago  Up 11 

Re: Help needed (alas, urgently)

2016-01-14 Thread Tim Chen
Hi Paul,

Looks like we've already issued the docker stop as you seen in the ps
output, but the containers are still running. Can you look at the Docker
daemon logs and see what's going on there?

And also can you also try to modify docker_stop_timeout to 0 so that we
SIGKILL the containers right away, and see if this still happens?

Tim



On Thu, Jan 14, 2016 at 11:52 AM, Paul Bell  wrote:

> Hi All,
>
> It's been quite some time since I've posted here and that's chiefly
> because up until a day or two ago, things were working really well.
>
> I actually may have posted about this some time back. But then the problem
> seemed more intermittent.
>
> In summa, several "docker stops" don't work, i.e., the containers are not
> stopped.
>
> Deployment:
>
> one Ubuntu VM (vmWare) LTS 14.04 with kernel 3.19
> Zookeeper
> Mesos-master (0.23.0)
> Mesos-slave (0.23.0)
> Marathon (0.10.0)
> Docker 1.9.1
> Weave 1.1.0
> Our application contains which include
> MongoDB (4)
> PostGres
> ECX (our product)
>
> The only thing that's changed at all in the config above is the version of
> Docker. Used to be 1.6.2 but I today upgraded it hoping to solve the
> problem.
>
>
> My automater program stops the application by sending Marathon an "http
> delete" for each running up. Every now & then (reliably reproducible today)
> not all containers get stopped. Most recently, 3 containers failed to stop.
>
> Here are the attendant phenomena:
>
> Marathon shows the 3 applications in deployment mode (presumably
> "deployment" in the sense of "stopping")
>
> *ps output:*
>
> root@71:~# ps -ef | grep docker
> root  3823 1  0 13:55 ?00:00:02 /usr/bin/docker daemon -H
> unix:///var/run/docker.sock -H tcp://0.0.0.0:4243
> root  4967 1  0 13:57 ?00:00:01 /usr/sbin/mesos-slave
> --master=zk://71.100.202.99:2181/mesos --log_dir=/var/log/mesos
> --containerizers=docker,mesos --docker=/usr/local/ecxmcc/weaveShim
> --docker_stop_timeout=15secs --executor_registration_timeout=5mins
> --hostname=71.100.202.99 --ip=71.100.202.99
> --attributes=hostType:ecx,shard1 --resources=ports:[31000-31999,8443-8443]
> root  5263  3823  0 13:57 ?00:00:00 docker-proxy -proto tcp
> -host-ip 0.0.0.0 -host-port 6783 -container-ip 172.17.0.2 -container-port
> 6783
> root  5271  3823  0 13:57 ?00:00:00 docker-proxy -proto udp
> -host-ip 0.0.0.0 -host-port 6783 -container-ip 172.17.0.2 -container-port
> 6783
> root  5279  3823  0 13:57 ?00:00:00 docker-proxy -proto tcp
> -host-ip 172.17.0.1 -host-port 53 -container-ip 172.17.0.2 -container-port
> 53
> root  5287  3823  0 13:57 ?00:00:00 docker-proxy -proto udp
> -host-ip 172.17.0.1 -host-port 53 -container-ip 172.17.0.2 -container-port
> 53
> root  7119  4967  0 14:00 ?00:00:01 mesos-docker-executor
> --container=mesos-20160114-135722-1674208327-5050-4917-S0.bfc5a419-30f8-43f7-af2f-5582394532f2
> --docker=/usr/local/ecxmcc/weaveShim --help=false
> --mapped_directory=/mnt/mesos/sandbox
> --sandbox_directory=/tmp/mesos/slaves/20160114-135722-1674208327-5050-4917-S0/frameworks/20160114-103414-1674208327-5050-3293-/executors/ecxconfigdb.1e6e0779-baf1-11e5-8c36-522bd4cc5ea9/runs/bfc5a419-30f8-43f7-af2f-5582394532f2
> --stop_timeout=15secs
> root  7378  4967  0 14:00 ?00:00:01 mesos-docker-executor
> --container=mesos-20160114-135722-1674208327-5050-4917-S0.9b700cdc-3d29-49b7-a7fc-e543a91f7b89
> --docker=/usr/local/ecxmcc/weaveShim --help=false
> --mapped_directory=/mnt/mesos/sandbox
> --sandbox_directory=/tmp/mesos/slaves/20160114-135722-1674208327-5050-4917-S0/frameworks/20160114-103414-1674208327-5050-3293-/executors/ecxcatalogdbs1.25911dda-baf1-11e5-8c36-522bd4cc5ea9/runs/9b700cdc-3d29-49b7-a7fc-e543a91f7b89
> --stop_timeout=15secs
> root  7640  4967  0 14:01 ?00:00:01 mesos-docker-executor
> --container=mesos-20160114-135722-1674208327-5050-4917-S0.d7d861d3-cfc9-424d-b341-0631edea4298
> --docker=/usr/local/ecxmcc/weaveShim --help=false
> --mapped_directory=/mnt/mesos/sandbox
> --sandbox_directory=/tmp/mesos/slaves/20160114-135722-1674208327-5050-4917-S0/frameworks/20160114-103414-1674208327-5050-3293-/executors/mongoconfig.2cb9163b-baf1-11e5-8c36-522bd4cc5ea9/runs/d7d861d3-cfc9-424d-b341-0631edea4298
> --stop_timeout=15secs
> *root  9696  9695  0 14:06 ?00:00:00 /usr/bin/docker stop -t
> 15
> mesos-20160114-135722-1674208327-5050-4917-S0.d7d861d3-cfc9-424d-b341-0631edea4298*
> *root  9709  9708  0 14:06 ?00:00:00 /usr/bin/docker stop -t
> 15
> mesos-20160114-135722-1674208327-5050-4917-S0.9b700cdc-3d29-49b7-a7fc-e543a91f7b89*
> *root  9720  9719  0 14:06 ?00:00:00 /usr/bin/docker stop -t
> 15
> mesos-20160114-135722-1674208327-5050-4917-S0.bfc5a419-30f8-43f7-af2f-5582394532f2*
>
> *docker ps output:*
>
> root@71:~# docker ps
> CONTAINER IDIMAGE COMMAND
>  CREATED STATUS

Re: Help needed (alas, urgently)

2016-01-14 Thread Paul Bell
Hi Tim,

Things have gotten slightly odder (if that's possible). When I now start
the application 5 or so containers, only one "ecxconfigdb" gets started -
and even he took a few tries. That is, I see him failing, moving to
deploying, then starting again. But I've no evidence (no STDOUT, and no
docker ctr logs) that show why.

In any event, ecxconfigdb does start. Happily, when I try to stop the
application I am seeing the phenomena I posted before: killing docker task,
shutting down repeated many times. The UN-stopped container is now running
at 100% CPU.

I will try modifying docker_stop_timeout. Back shortly

Thanks again.

-Paul

PS: what do you make of the "broken pipe" error in the docker.log?

*from /var/log/upstart/docker.log*

[34mINFO[3054] GET /v1.15/images/mongo:2.6.8/json
INFO[3054] GET
/v1.21/images/mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b/json
ERRO[3054] Handler for GET
/v1.21/images/mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b/json
returned error: No such image:
mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b
ERRO[3054] HTTP Error
 err=No such image:
mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b
statusCode=404
INFO[3054] GET /v1.15/containers/weave/json
INFO[3054] POST
/v1.21/containers/create?name=mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b
INFO[3054] POST
/v1.21/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/attach?stderr=1=1=1
INFO[3054] POST
/v1.21/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/start
INFO[3054] GET
/v1.15/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/json
INFO[3054] GET
/v1.15/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/json
INFO[3054] GET /v1.15/containers/weave/json
INFO[3054] GET
/v1.15/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/json
INFO[3054] GET
/v1.15/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/json
INFO[3054] GET /v1.15/containers/weave/json
INFO[3054] GET
/v1.15/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/json
INFO[3054] GET
/v1.21/containers/mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b/json
INFO[3111] GET /v1.21/containers/json
INFO[3120] GET /v1.21/containers/cf7/json
INFO[3120] GET
/v1.21/containers/cf7/logs?stderr=1=1=all
INFO[3153] GET /containers/json
INFO[3153] GET
/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/json
INFO[3153] GET
/containers/56111722ef83134f6c73c5e3aa27de3f34f1fa73efdec3257c3cc9b283e40729/json
INFO[3153] GET
/containers/b9e9b79a8d431455bfcaafca59223017b2470a47a294075d656eeffdaaefad33/json
INFO[3175] GET /containers/json
INFO[3175] GET
/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/json
INFO[3175] GET
/containers/56111722ef83134f6c73c5e3aa27de3f34f1fa73efdec3257c3cc9b283e40729/json
INFO[3175] GET
/containers/b9e9b79a8d431455bfcaafca59223017b2470a47a294075d656eeffdaaefad33/json
*INFO[3175] POST
/v1.21/containers/mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b/stop*
?t=15
*ERRO[3175] attach: stdout: write unix @: broken pipe*
*INFO[3190] Container
cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47 failed to
exit within 15 seconds of SIGTERM - using the force *
*INFO[3200] Container cf7fc7c48324 failed to exit within 10
seconds of kill - trying direct SIGKILL *

*STDOUT from Mesos:*

*--container="mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b"
*--docker="/usr/local/ecxmcc/weaveShim" --help="false"
--initialize_driver_logging="true" --logbufsecs="0" --logging_level="INFO"
--mapped_directory="/mnt/mesos/sandbox" --quiet="false"
--sandbox_directory="/tmp/mesos/slaves/20160114-153418-1674208327-5050-3798-S0/frameworks/20160114-103414-1674208327-5050-3293-/executors/ecxconfigdb.c3cae92e-baff-11e5-8afe-82f779ac6285/runs/c5c35d59-1318-4a96-b850-b0b788815f1b"
--stop_timeout="15secs"
--container="mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b"
--docker="/usr/local/ecxmcc/weaveShim" --help="false"
--initialize_driver_logging="true" --logbufsecs="0" --logging_level="INFO"
--mapped_directory="/mnt/mesos/sandbox" --quiet="false"

Re: Help needed (alas, urgently)

2016-01-14 Thread Paul Bell
Hey Tim,

Thank you very much for your reply.

Yes, I am in the midst of trying to reproduce the problem. If successful
(so to speak), I will do as you ask.

Cordially,

Paul

On Thu, Jan 14, 2016 at 3:19 PM, Tim Chen  wrote:

> Hi Paul,
>
> Looks like we've already issued the docker stop as you seen in the ps
> output, but the containers are still running. Can you look at the Docker
> daemon logs and see what's going on there?
>
> And also can you also try to modify docker_stop_timeout to 0 so that we
> SIGKILL the containers right away, and see if this still happens?
>
> Tim
>
>
>
> On Thu, Jan 14, 2016 at 11:52 AM, Paul Bell  wrote:
>
>> Hi All,
>>
>> It's been quite some time since I've posted here and that's chiefly
>> because up until a day or two ago, things were working really well.
>>
>> I actually may have posted about this some time back. But then the
>> problem seemed more intermittent.
>>
>> In summa, several "docker stops" don't work, i.e., the containers are not
>> stopped.
>>
>> Deployment:
>>
>> one Ubuntu VM (vmWare) LTS 14.04 with kernel 3.19
>> Zookeeper
>> Mesos-master (0.23.0)
>> Mesos-slave (0.23.0)
>> Marathon (0.10.0)
>> Docker 1.9.1
>> Weave 1.1.0
>> Our application contains which include
>> MongoDB (4)
>> PostGres
>> ECX (our product)
>>
>> The only thing that's changed at all in the config above is the version
>> of Docker. Used to be 1.6.2 but I today upgraded it hoping to solve the
>> problem.
>>
>>
>> My automater program stops the application by sending Marathon an "http
>> delete" for each running up. Every now & then (reliably reproducible today)
>> not all containers get stopped. Most recently, 3 containers failed to stop.
>>
>> Here are the attendant phenomena:
>>
>> Marathon shows the 3 applications in deployment mode (presumably
>> "deployment" in the sense of "stopping")
>>
>> *ps output:*
>>
>> root@71:~# ps -ef | grep docker
>> root  3823 1  0 13:55 ?00:00:02 /usr/bin/docker daemon -H
>> unix:///var/run/docker.sock -H tcp://0.0.0.0:4243
>> root  4967 1  0 13:57 ?00:00:01 /usr/sbin/mesos-slave
>> --master=zk://71.100.202.99:2181/mesos --log_dir=/var/log/mesos
>> --containerizers=docker,mesos --docker=/usr/local/ecxmcc/weaveShim
>> --docker_stop_timeout=15secs --executor_registration_timeout=5mins
>> --hostname=71.100.202.99 --ip=71.100.202.99
>> --attributes=hostType:ecx,shard1 --resources=ports:[31000-31999,8443-8443]
>> root  5263  3823  0 13:57 ?00:00:00 docker-proxy -proto tcp
>> -host-ip 0.0.0.0 -host-port 6783 -container-ip 172.17.0.2 -container-port
>> 6783
>> root  5271  3823  0 13:57 ?00:00:00 docker-proxy -proto udp
>> -host-ip 0.0.0.0 -host-port 6783 -container-ip 172.17.0.2 -container-port
>> 6783
>> root  5279  3823  0 13:57 ?00:00:00 docker-proxy -proto tcp
>> -host-ip 172.17.0.1 -host-port 53 -container-ip 172.17.0.2 -container-port
>> 53
>> root  5287  3823  0 13:57 ?00:00:00 docker-proxy -proto udp
>> -host-ip 172.17.0.1 -host-port 53 -container-ip 172.17.0.2 -container-port
>> 53
>> root  7119  4967  0 14:00 ?00:00:01 mesos-docker-executor
>> --container=mesos-20160114-135722-1674208327-5050-4917-S0.bfc5a419-30f8-43f7-af2f-5582394532f2
>> --docker=/usr/local/ecxmcc/weaveShim --help=false
>> --mapped_directory=/mnt/mesos/sandbox
>> --sandbox_directory=/tmp/mesos/slaves/20160114-135722-1674208327-5050-4917-S0/frameworks/20160114-103414-1674208327-5050-3293-/executors/ecxconfigdb.1e6e0779-baf1-11e5-8c36-522bd4cc5ea9/runs/bfc5a419-30f8-43f7-af2f-5582394532f2
>> --stop_timeout=15secs
>> root  7378  4967  0 14:00 ?00:00:01 mesos-docker-executor
>> --container=mesos-20160114-135722-1674208327-5050-4917-S0.9b700cdc-3d29-49b7-a7fc-e543a91f7b89
>> --docker=/usr/local/ecxmcc/weaveShim --help=false
>> --mapped_directory=/mnt/mesos/sandbox
>> --sandbox_directory=/tmp/mesos/slaves/20160114-135722-1674208327-5050-4917-S0/frameworks/20160114-103414-1674208327-5050-3293-/executors/ecxcatalogdbs1.25911dda-baf1-11e5-8c36-522bd4cc5ea9/runs/9b700cdc-3d29-49b7-a7fc-e543a91f7b89
>> --stop_timeout=15secs
>> root  7640  4967  0 14:01 ?00:00:01 mesos-docker-executor
>> --container=mesos-20160114-135722-1674208327-5050-4917-S0.d7d861d3-cfc9-424d-b341-0631edea4298
>> --docker=/usr/local/ecxmcc/weaveShim --help=false
>> --mapped_directory=/mnt/mesos/sandbox
>> --sandbox_directory=/tmp/mesos/slaves/20160114-135722-1674208327-5050-4917-S0/frameworks/20160114-103414-1674208327-5050-3293-/executors/mongoconfig.2cb9163b-baf1-11e5-8c36-522bd4cc5ea9/runs/d7d861d3-cfc9-424d-b341-0631edea4298
>> --stop_timeout=15secs
>> *root  9696  9695  0 14:06 ?00:00:00 /usr/bin/docker stop -t
>> 15
>> mesos-20160114-135722-1674208327-5050-4917-S0.d7d861d3-cfc9-424d-b341-0631edea4298*
>> *root  9709  9708  0 14:06 ?00:00:00 /usr/bin/docker stop -t
>> 15
>> 

Re: Help needed (alas, urgently)

2016-01-14 Thread Paul Bell
I spoke to soon, I'm afraid.

Next time I did the stop (with zero timeout), I see the same phenomenon: a
mongo container showing repeated:

killing docker task
shutting down


What else can I try?

Thank you.

On Thu, Jan 14, 2016 at 5:07 PM, Paul Bell  wrote:

> Hi Tim,
>
> I set docker_stop_timeout to zero as you asked. I am pleased to report
> (though a bit fearful about being pleased) that this change seems to have
> shut everyone down pretty much instantly.
>
> Can you explain what's happening, e.g., does docker_stop_timeout=0 cause
> the immediate use of "kill -9" as opposed to "kill -2"?
>
> I will keep testing the behavior.
>
> Thank you.
>
> -Paul
>
> On Thu, Jan 14, 2016 at 3:59 PM, Paul Bell  wrote:
>
>> Hi Tim,
>>
>> Things have gotten slightly odder (if that's possible). When I now start
>> the application 5 or so containers, only one "ecxconfigdb" gets started -
>> and even he took a few tries. That is, I see him failing, moving to
>> deploying, then starting again. But I've no evidence (no STDOUT, and no
>> docker ctr logs) that show why.
>>
>> In any event, ecxconfigdb does start. Happily, when I try to stop the
>> application I am seeing the phenomena I posted before: killing docker task,
>> shutting down repeated many times. The UN-stopped container is now running
>> at 100% CPU.
>>
>> I will try modifying docker_stop_timeout. Back shortly
>>
>> Thanks again.
>>
>> -Paul
>>
>> PS: what do you make of the "broken pipe" error in the docker.log?
>>
>> *from /var/log/upstart/docker.log*
>>
>> [34mINFO [0m[3054] GET /v1.15/images/mongo:2.6.8/json
>> [34mINFO [0m[3054] GET
>> /v1.21/images/mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b/json
>> [31mERRO [0m[3054] Handler for GET
>> /v1.21/images/mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b/json
>> returned error: No such image:
>> mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b
>> [31mERRO [0m[3054] HTTP Error [31merr
>> [0m=No such image:
>> mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b
>> [31mstatusCode [0m=404
>> [34mINFO [0m[3054] GET /v1.15/containers/weave/json
>> [34mINFO [0m[3054] POST
>> /v1.21/containers/create?name=mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b
>> [34mINFO [0m[3054] POST
>> /v1.21/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/attach?stderr=1=1=1
>> [34mINFO [0m[3054] POST
>> /v1.21/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/start
>> [34mINFO [0m[3054] GET
>> /v1.15/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/json
>> [34mINFO [0m[3054] GET
>> /v1.15/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/json
>> [34mINFO [0m[3054] GET /v1.15/containers/weave/json
>> [34mINFO [0m[3054] GET
>> /v1.15/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/json
>> [34mINFO [0m[3054] GET
>> /v1.15/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/json
>> [34mINFO [0m[3054] GET /v1.15/containers/weave/json
>> [34mINFO [0m[3054] GET
>> /v1.15/containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/json
>> [34mINFO [0m[3054] GET
>> /v1.21/containers/mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b/json
>> [34mINFO [0m[3111] GET /v1.21/containers/json
>> [34mINFO [0m[3120] GET /v1.21/containers/cf7/json
>> [34mINFO [0m[3120] GET
>> /v1.21/containers/cf7/logs?stderr=1=1=all
>> [34mINFO [0m[3153] GET /containers/json
>> [34mINFO [0m[3153] GET
>> /containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/json
>> [34mINFO [0m[3153] GET
>> /containers/56111722ef83134f6c73c5e3aa27de3f34f1fa73efdec3257c3cc9b283e40729/json
>> [34mINFO [0m[3153] GET
>> /containers/b9e9b79a8d431455bfcaafca59223017b2470a47a294075d656eeffdaaefad33/json
>> [34mINFO [0m[3175] GET /containers/json
>> [34mINFO [0m[3175] GET
>> /containers/cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47/json
>> [34mINFO [0m[3175] GET
>> /containers/56111722ef83134f6c73c5e3aa27de3f34f1fa73efdec3257c3cc9b283e40729/json
>> [34mINFO [0m[3175] GET
>> /containers/b9e9b79a8d431455bfcaafca59223017b2470a47a294075d656eeffdaaefad33/json
>> * [34mINFO [0m[3175] POST
>> /v1.21/containers/mesos-20160114-153418-1674208327-5050-3798-S0.c5c35d59-1318-4a96-b850-b0b788815f1b/stop*
>> ?t=15
>> * [31mERRO [0m[3175] attach: stdout: write unix @: broken pipe*
>> * [34mINFO [0m[3190] Container
>> cf7fc7c483248e30f1dbb5990ce8874f2bfbe936c74eed1fc9af6f70653a1d47 failed to
>> exit within 15 seconds of SIGTERM - using the force *
>> * [34mINFO [0m[3200] Container cf7fc7c48324 failed to exit within 10
>> seconds of kill - trying direct SIGKILL *
>>
>> *STDOUT from Mesos:*
>>
>>